Doctoral Thesis

The problem

SRAM-FPGAs are widely used in space missions, aerospace, medical devices, data centers, nuclear reactors and high-energy particle accelerators. They are valued for their parallelism, high logic capacity, reconfigurability and the ability to update designs in the field without touching the hardware. The catch is that their configuration memory is built from SRAM cells, which are vulnerable to single event upsets (SEUs): radiation-induced bit-flips that can silently change what the FPGA does. The device miniaturization brought about by continued CMOS scaling has two effects on error mitigation i) as the transistors become smaller, they become more susceptible to multiple-bit upsets, weakening traditional mitigation schemes ii) the increased number of configuration memory bits makes validation through fault injection increasingly expensive.

Proposed solutions

My work tackles three dependability problems: detecting failures in the mitigation hardware itself, making fault injection campaigns cheaper without losing statistical rigor, and locating the configuration bits that actually matter.

Detecting scrubber failures

Scrubbers are the workhorse mitigation scheme for FPGA configuration memory. They continuously read back the bitstream and correct errors. But scrubbers themselves can fail, and those failures are hard to observe. I developed two non-invasive, log-based frameworks for monitoring scrubber health. The first is a Markov chain model that leans on IP specifications to track state transitions and flag anomalies. The second, AnoDe, is a self-supervised system that requires no domain knowledge of the underlying IP, which makes it portable across operational scenarios where specs aren't fully available.

Optimizing statistical fault injection

Statistical fault injection is the standard way to estimate how an FPGA design will behave under SEUs, but the required sample size grows with uncertainty about the underlying failure rate. I proposed a Bayesian sampling framework that folds prior knowledge into the sampling procedure, cutting the number of injections needed while preserving the statistical confidence and black-box nature of the classical statistical approach.

Accelerating fault injection without reverse-engineering the layout

Identifying the critical bits of a configuration memory usually requires reverse-engineering the FPGA layout — slow, vendor-specific and often off-limits for commercial parts. I explored machine learning-based alternatives: a Monte Carlo Tree Search strategy that guides single-bit injection towards critical regions, and Long Short-Term Memory models that predict the outcome of multiple-bit upsets. Both plug into existing fault-injection setups and are especially useful when access to radiation facilities is limited.

The significance

Taken together, these contributions push commercial SRAM-FPGAs further along the path toward safety-critical deployment. They give designers tools to verify that their mitigation infrastructure is working as intended, to run leaner fault-injection campaigns, and to focus testing effort on the critical bits that are most likely to cause harm.

Papers

The thesis is built around seven research papers. The full thesis is available on the KTH DiVA portal.

  1. A Markovian Approach for Detecting Failures in the Xilinx SEM Core
  2. AnoDe: A Log-based Self-Supervised Framework to Detect Scrubber Failures in SRAM-FPGA
  3. Navigating the Challenges of Statistical Fault Injection in SRAM-FPGA
  4. Bayesian Sampling Framework for Improved Statistical Fault Injection
  5. Guided Fault Injection Strategy for Rapid Critical Bit Detection in Radiation-Prone SRAM-FPGA
  6. Predictive Modeling of Multi-Bit Upsets for Emulated Fault Injection
  7. Exploring the Potential of LSTM on Emulating Multiple-bit Fault Injection in SRAM-FPGA

Defense presentation

The slide deck from my doctoral defense

Slide 1
1 / 68