## Single Event Effects in SRAM based FPGA for space applications Analysis and Mitigation

## Diagnostic Services in Network-on-Chips (DSNOC'09)

Roland Weigand David Merodio Codinachs European Space Agency Microelectronics Section



24<sup>th</sup> April 2009

Slide # (1)

# Outline (1)

- Introduction on radiation effects
  - Total Ionising Dose (TID) effects
  - Single Event Latch-up (SEL)
  - Single Event Transient (SET) Effects
  - Single Event Upset (SEU) in user flip-flops and RAM
  - Single Event Upset (SEU) in FPGA configuration memory
  - Single Event Functional Interrupts (SEFI)
  - Quantifying SEE: LET threshold, cross-section, statistical upset rates

## SEE mitigation, in general and dedicated to SRAM FPGA

- Triple Modular Redundancy (TMR) for flip-flops in ASIC designs
- Functional TMR (FTMR) and the Xilinx TMR tool (XTMR) for SRAM FPGA
- Configuration memory scrubbing
- Reliability Oriented Place & Route algorithm (RoRA)
- Block and device level redundancy
- Temporal Redundancy
- Rad-hard reconfigurable FPGA



# Outline (2)

- Analysis of SEE, verification of mitigation methods
  - → Radiation testing: Heavy Ions, Protons, Neutrons
  - Fault simulation and fault injection
  - Functional an formal verification
  - Analysis of circuit topology
- Selection of the appropriate mitigation strategy
- Actual or planned use of SRAM FPGA in space projects
  - Example: Mars Explorer
- Conclusion
  - Are Single Event Effects a concern in non-space applications?
  - Are our SEE mitigation methods suitable for NoC?
  - What happens in future technology generations?

## References



## Radiation effects in space components

- Presence of Galactic Cosmic Rays and Solar Flares
- Total Ionising Dose (TID)
  - → Defects in the semiconductor lattice, degradation of mobility and V<sub>th</sub>
  - Reduced speed, increased leakage current at end-of-life
  - Mitigation: process, cell layout (guardrings), design margins (derating)
- Single Event Effects (SEE)
  - Electron-hole pair generation by interaction with heavy ions
  - Glitches when carriers are caught by drain pn-junctions



# **Single Event Effects**

- Single Event Latchup (SEL)
  - SEE induced triggering of parasitic thyristors
  - Mitigation: process and cell layout
- Single Event Transients (SET) in clocks and resets
  - → Glitches on clocks  $\rightarrow$  change of state, functional fault
  - Asynchronous resets are clock-like signals
- Single Event Transients (SET) in combinatorial logic
  - SEE glitches in combinatorial logic behave like cross-talk effects
  - Causes SEU when arriving at flip-flop/memory D-input during clock edge
  - Sensitivity increases with clock frequency
  - Synchronous resets are (normal) combinatorial signals
- Single Event Upset (SEU) in Flip-Flops and SRAM
  - SEE glitch inside the bistable feedback loop of storage point
  - → Immediate bit flip  $\rightarrow$  loss of information, change of state, functional fault



# Single Event Effects in SRAM FPGA

## Single Event Upset (SEU) in configuration memory

In SRAM FPGA, the circuit itself is stored in a RAM. A bit flip can modify the circuit functionality – e.g.

» modifying a look-up-table (combinatorial function)

» changing IO configuration (revert IO direction)

» causing an open connection

» causing a short circuit

### Single Event Functional Interrupts (SEFI)

- Defined in [2]: SEFI is an SEE that results in the interference of the normal operation of a complex digital circuit. SEFI is typically used to indicate a failure in a support circuit, such as:
  - » a region of configuration memory, or the entire configuration.
  - » loss of JTAG or configuration capability
  - » Clock generators
  - » JTAG functionality
  - » power on reset



# **Quantifying SEE**

## LET (Linear Energy Transfer) threshold (unit: MeV \* cm² / mg)

- LET = energy per length unit transferred by an ion travelling through the device (MeV/cm) divided by the mass density (Si = 2320 mg/cm<sup>3</sup>)
- → LET threshold is the minimum LET to cause an effect (activation energy)
- (Saturated) Cross-Section (unit: cm²/device or cm²/bit)
  - X-section = Number of errors / Ion fluence
  - Saturated value is the horizontal part of the curve
- During radiation test
  - Measure LET vs. X-section
  - LET depends on ion energy and on the test setup (tilt)
- But how does my chip behave in orbit, in real application?



# **Device/Bit Error Rates**

### Error rate in space is related to the energy spectrum

- Depending on the orbit (low earth orbit, geostationary etc.)
- Depending on solar conditions (11 years min/max cycle, flares)
- Influence of the magnetic field
- Radiation belts
- Different Error Rates
  - Bit error rate: # errors/bit/day
  - → # errors/device/day
  - FIT = # failures in 10<sup>9</sup> hours



## CREME96 [3]

- Numerical models of the ionising radiation environment
- Calculate error rates from LET vs. X-section curve and orbit parameters
- Developed by the US Naval Research Laboratory

24<sup>th</sup> April 2009

Slide # (8)

# Mitigation of SEU in User Logic

Standard synchronous RTL design



TMR and single voters for flip-flops for hard-wired logic (ASIC)



Functional TMR (FTMR) [4] for SRAM (reprogrammable) FPGA





24<sup>th</sup> April 2009

Slide # (9)

# FTMR – XTMR

- FTMR is based on full triplication of the design and majority voting at all flip-flop inputs and/or outputs
  - Tolerates single bit flips anywhere in user or configuration memory
    - » Bit flips are 'voted' out in the next clock cycle
  - Mitigates SET effects (glitches in clocks and combinatorial logic)
  - The VHDL approach presented in [4] requires a special coding style, it is synthesis and P&R tool dependent and therefore difficult to use

## XTMR developed by Xilinx has a very similar topology

- Voters only in the feedback paths (counters, state machines)
  - » Bit flips are voted out within N clock cycles
    (N = number of stages of linear data path)
  - » less area and routing overhead
- Implemented automatically by the TMRTool [5]
- Independent of HDL coding style and synthesis tool
- Well integrated with the ISE tool chain
- Also triples primary IO signals

# Multiple SEU – Configuration Scrubbing

### Multiple bit flips can be

- Single bit flips (SEU), accumulated over time
- → A single particle flipping several bits (Multiple Bit Upset MBU)

## Neither XTMR nor FTMR tolerate multiple bit flips

- Refresh of configuration memory at regular intervals required
- Background configuration scrubbing by partial reconfiguration [6]
  without stopping operation of the user design function
- Scrubbing protects against accumulated single bit flips, provided the scrubbing rate is several times faster than the statistical bit upset rate
- Requires an external rad-hard scrubbing controller

## Scrubbing does not protect against MBU

- MBU are rare in current technology
- MBU could become an issue in future technology generations
- MBU usually affects physically adjacent memory cells
- MBU mitigation requires in-depth knowledge of the chip topology

# **RoRA: Mitigation at Place and Route**

### In spite of (X)TMR, single point failures (SPF) still exist

- Optimisation during layout leads to close-proximity implementation
  - » Flipping one bit may create a short between two voter domains
  - » Flipping one bit may change a constant (0 or 1) used in two domains

### Malfunction in two domains at a time can not be voted out any more



### The Reliability oriented place & Route Algorithm (RoRA) [7]

- Disentangles the three voter domains
- Reduces the number of SPF (bits affecting several resources)
- Besides giving additional fault tolerance to (X)TMR designs, RoRA is applicable also to non- or partial-TMR designs

# **Protection of SRAM blocks (1)**

## EDAC = Error Detection And Correction

- Usually corrects single and detects multiple bit flips per memory word
- Regular access required to preventing error accumulation (scrubbing)
- Control state machine required to rewrite corrected data
- Impact on max. clock frequency (XOR tree)

## Parity protection allows detection but no hardware correction

- When redundant data is available elsewhere in the system
  - » Embedded cache memories (duplicates of external memory)  $\rightarrow$  LEON2-FT
  - » Duplicated memories (reload correct data from replica)  $\rightarrow$  LEON3-FT
- On error: reload in by hardware state machine or software (reboot)

## Proprietary solutions from FPGA vendors

- → ACTEL core generator [24]
  - » EDAC and scrubbing

### XILINX XTMR [5]

» Triplication, voting and scrubbing

# **Protection of SRAM blocks (2)**

### EDAC protected memory (Actel)

- Scrubbing takes place only in idle mode (we, re = inactive)
- Required memory width
  - » 18-bit for data bits <= 12
  - » 36-bit for 12 < data bits <= 29
  - » 54-bit for 20 < data bits <= 47



- Triplicated memory (Xilinx)
  - Scrubbing in background using spare port of dual-port memory
  - Triplication against configuration upset





# **Other Mitigation Techniques (1)**

### Block and device level redundancy [6]

- Implementation of each design is plain (non-voted)
- Design/verification of plain blocks/devices does not require special tools
- $\rightarrow$  2x1 implementation ( $\rightarrow$  error detection and restart)
- → 3x1 or 2x2 implementation ( $\rightarrow$  continue operation in case of fault)





# **Other Mitigation Techniques (2)**

### ... Block and device level redundancy

- Redundant blocks or devices must be re-synchronised
  - » Context copying when error in one instance is detected
  - » Reset system or restore context from snapshot stored at regular intervals

### Device TMR overcomes shortage of gate resources and IO pins

- Device TMR also protects against SEFI
- Device TMR requires separate rad-hard voting and reconfiguration unit
- Also applied for non-FPGA COTS devices [11]

## Temporal redundancy

- Repeat processing two or more times and vote result
- Employed for embedded microprocessors

## Partial (Selective) TMR [12]

- Triple only the most sensitive parts of a system
- Trade fault tolerance against complexity, but difficult to validate

## Single instance and watchdog

# Rad-Hard Reconfigurable FPGA (1)

- The Atmel ATF280E [8]
  - The ATF280E is a radiation hardened SRAM-based reprogrammable FPGA
  - It has SEE hardened
    - » Configuration memory
    - » User flip-flops
    - » User memory
  - It offers 280K equivalent ASIC gates and 115Kb of RAM
  - Packages MQFP256 / MCGA 472 with 150 / 308 user I/O
  - Implemented in 180 nm technology
  - Development of larger devices is planned in cooperation between
    - » Atmel Aerospace
    - » Abound Logic http://www.aboundlog
    - » CNES (French Space Agency)
    - » JAXXA (Japanese Space Agency)
    - » ESA (European Space Agency)



TF280E2J-E

0706 678044

Slide # (17)





# Rad-Hard Reconfigurable FPGA (2)

## The Xilinx SIRF Project [9]

- SIRF = Single-event effects Immune Reconfigurable FPGA
- Based on the Virtex5 architecture, implemented in 65 nm technology
- Developed under US air force funding
- Subject to export regulations (ITAR)
- → Packages FF665/1136/1738 (TBC)
- Flash based FPGA
  - → Actel Pro-ASIC [10]
  - Radiation evaluation is ongoing
  - ASIC-like SEE mitigation required
  - → Flash is reconfigurable
    - » A limited number of reconfiguration cycles
    - » No on-line reconfiguration (while circuit is operating)
  - → Packages CCGA/LGA-484, 896







24th April 2009

Slide # (18)

## **Rad-Hard FPGA Overview**





# Verification of fault-tolerant designs

- Verification has to answer three main questions
  - Does the mitigation strategy provide adequate fault tolerance?
    - » Radiation testing, fault simulation and fault emulation
  - Was the planned mitigation strategy properly implemented?
    - » Analysis of netlist and physical implementation (layout)

### Are we sure the TMR did not break the circuit function?

» Dedicated formal verification tools are required

## Standard verification methods and tools are not sufficient

- Simulation of a TMR netlist "works" with a defect in one voter domain
- COTS formal verification tools are confused by TMR
- Structural verification of TMR ASIC designs: InFault [19]
- NASA/Mentor: Formal verification for TMR designs [1]
- STAR, the STatic AnalyzeR tool [20]
  - » Performs static analysis of a TMR circuit layout in SRAM FPGA
  - » Identifies critical configuration bits (single bit affecting two voter domains)

# **Radiation Testing**

- There is nothing like real data to f' up a great theory
  - → Richard Katz, NASA Office of Logic Design, circa 1995
- Heavy lon Testing
  - → Using fission products (e.g. Californium 252) [13]
  - → Cyclotron, e.g. UCL [14]







- Other Radiation Testing
  - → Proton testing e.g. PSI [15]
    - Protons penetrate silicon  $\rightarrow$  backside irradiation, suitable for flip-chip
  - Neutron Testing, interesting for ground and aircraft applications



24<sup>th</sup> April 2009

Slide # (21)

## **Fault Simulation and Emulation**

- Fault injection to user flip-flops (but not configuration memory)
  - → SST, an SEU simulation tool [16]
  - → FT-Unshades for user flip-flops and memory [17]
- Fault injection to configuration memory by FPGA emulation
  - → The FLIPPER test system [18]





Injections (Cfg bit upsets)

24<sup>th</sup> April 2009

Slide # (22)





# **Selection of a Mitigation Strategy**

- SEE mitigation has area and performance overhead
- Trade-off between cost and fault tolerance
  - Same hardening scheme for the complete design is easiest to implement
  - Selective hardening of critical parts is often the only acceptable solution
  - Life time requirement of applications can be very different





Slide # (23)

## **SRAM FPGA in Space Projects**

- FPGA are flying on several, mostly US space missions [21]
- Various mitigation schemes are used
  - Many of them use device level redundancy
  - Most of them involve configuration readback or scrubbing
  - Example: Pyro module on the Mars Explorer Rover, launched 2003



## SEE in non-space applications

### Increasing SEE awareness also in non-space designs

- High reliability products: Avionics, Networking, Medical
- Radiation is different (Neutrons and Alpha)
- Functional effects are the same as in space
- Several companies are affected by SEE effects [22]
  - Recall of Sun Enterprise servers (late 90's)
  - CISCO SEU application note for network products

### Neutron Testing shows non-negligeable error rates [23]

|        |          | Configuration<br>vice Upsets (SEUs) | Logic Errors<br>(SEFIs) | Logic Error FIT Rates |            |              |              |
|--------|----------|-------------------------------------|-------------------------|-----------------------|------------|--------------|--------------|
| Vendor | Device   |                                     |                         | Sea Level             | 5,000 ft.  | 30,000 ft.   | 60,000 ft.   |
| Actel  | AX1000   | Not Measured                        | 0 3                     | <0.08 FITs            | <0.28 FITs | <12 FITs     | <39 FITs     |
| Actel  | APA1000  | Not Measured                        | 0                       | <0.04 FITs            | <0.13 FITs | <5.6 FITs    | <18 FITs     |
| Xilinx | XC2V3000 | 3,459                               | 349                     | 1,150 FITs            | 3,900 FITs | 170,000 FITs | 540,000 FITs |
| Xilinx | XC3S1000 | 1,936                               | 405                     | 320 FITs              | 1,100 FITs | 47,000 FITs  | 150,000 FITs |
| Altera | EP1C20   | Not Measured                        | 453                     | 460 FITs              | 1,600 FITs | 67,000 FITs  | 220,000 FITs |

Table 1: Neutron Test Results (iRoC Testing at LANSCE)

![](_page_24_Picture_11.jpeg)

Slide # (25)

# **SEE** Mitigation for NoC

- Analysis and trade-off required, as for any other design
  - Criticality, area and performance overhead
- A NoC can also be protected by XTMR
  - Block memories should be protected by EDAC and scrubbing
  - The > 3x area overhead may be tolerated if NoC is not too large
  - But do we really need it?

Alternatives:

- Measures at protocol level
  - » Use acknowledgement, retransmission and timeout mechanism
  - » "Running TCP instead of UDP"?
  - » Temporal redundancy: send packets twice and compare

### Error detection and recovery

- » Parity bits on all registers in the data path
- » Reset Network when error detected
- » Resend all ongoing packets

# Conclusion

- Single Event Effects are real, even on ground
  - They are serious for high-reliability applications
- SEE effects increase in smaller technology (<= 65 nm)</li>
  - Redundancy remains applicable, but may need enhancement
  - Upcoming dedicated rad-hard FPGA designs
- Mitigation requires careful analysis, trade-off and verification
- Use scrubbing on configuration memory
- (X)TMR gives good protection and is easy to implement
  - But it has huge overheads if applied it on complete systems
  - Alternatives or partial hardening may be preferred
- SEU hardening of the NoC infrastructure
  - A NoC can also be protected by XTMR
  - Alternatives using smart protocols

## **Questions?**

# References/Links (1)

#### [1] Melanie Berg: Design for Radiation Effects

http://nepp.nasa.gov/mapId\_2008/presentations/i/01%20-%20Berg\_Melanie\_mapId08\_pres\_1.pdf

### [2] Single-Event Upset Mitigation Selection Guide, Xilinx Application Note XAPP987

http://www.xilinx.com/support/documentation/application\_notes/xapp987.pdf

#### [3] CREME96: Cosmic Ray Effects on Micro-Electronics

https://creme96.nrl.navy.mil/

### [4] Sandi Habinc: Functional Triple Modular Redundancy (FTMR)

http://microelectronics.esa.int/techno/fpga\_003\_01-0-2.pdf

### [5] The Xilinx TMRTool

http://www.xilinx.com/ise/optional\_prod/tmrtool.htm

### [6] Xilinx Application Notes concerning SEU mitigation in Virtex-II/Virtex-4

http://www.xilinx.com/support/documentation/application\_notes/xapp987.pdf http://www.xilinx.com/support/documentation/application\_notes/xapp779.pdf http://www.xilinx.com/support/documentation/application\_notes/xapp988.pdf

#### [7] A new reliability-oriented place and route algorithm for SRAM-based FPGAs, Sterpone, Luca; Violante, Massimo;

IEEE Transactions on Computers, Volume 55, Issue 6, June 2006

http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=12&year=2006

### [8] The Atmel ATF280E Advance Information

http://www.atmel.com/dyn/resources/prod\_documents/doc7750.pdf

![](_page_27_Picture_18.jpeg)

# References/Links (2)

### [9] The Xilinx SEU Immune Reconfigurable FPGA (SIRF) project

http://klabs.org/mapId05/presento/176\_bogrow\_p.ppt

### [10] Actel Rad Tolerant ProASIC3

http://www.actel.com/products/milaero/rtpa3/default.aspx

### [11] Super Computer for Space (SCS750), Maxwell, ESCCON 2002

http://www.maxwell.com/microelectronics/support/presentations/ESCCON\_2002.pdf

24<sup>th</sup> April 2009

Slide # (29)

[12] Selective Triple Modular Redundancy for SEU Mitigation in FPGAs, Praveen Kumar Samudrala, Jeremy Ramos, and Srinivas Katkoori

http://www.klabs.org/richcontent/MAPLDCon03/abstracts/samudrala\_a.pdf

### [13] The CASE System, Californium 252 radiation facility at ESTEC https://escies.org/ReadArticle?docId=252

# [14] PIF, the Proton Irradiation Facility at Paul Scherrer Institute, Switzerland

http://pif.web.psi.ch/

### [15] HIF, Heavy Ion Facility at University of Louvain-la-Neuve, Belgium

http://www.cyc.ucl.ac.be/HIF/HIF.html

### [16] SST: The SEU Simulation Tool

http://microelectronics.esa.int/asic/SST-FunctionalDescription1-3.pdf http://www.nebrija.es/~jmaestro/esa/sst.htm

![](_page_28_Picture_16.jpeg)

# References/Links (3)

### [17] FT-Unshades, a Xilinx-based SEU emulator

http://microelectronics.esa.int/mpd2004/FT-UNSHADES\_presentation\_v2.pdf

### [18] The FLIPPER SEU test system

http://microelectronics.esa.int/finalreport/Flipper\_Executive\_Summary.pdf http://microelectronics.esa.int/techno/Flipper\_ProductSheet.pdf

#### [19] Simon Schulz, Giovanni Beltrame, David Merodio Codinachs: Smart Behavioural Netlist Simulation for SEU Protection Verification

http://microelectronics.esa.int/papers/SimonSchulzInFault.pdf

### [20] Static and Dynamic Analysis of SEU effects in SRAM-based FPGAs

L. Sterpone, M. Violante, European Test Symposium ETS2007

### [21] Xilinx space flight heritage, NASA GSFC, June 2006

http://nepp.nasa.gov/DocUploads/6466B702-93C3-4E3E-928BBD09A24CF7FA/Xilinx%20Flight %20Heritage\_NASA\_GSFC.ppt

### [22] Cosmic Radiation comes to ASIC and SOC Design, EDN, May 12, 2005

http://www.edn.com/contents/images/529381.pdf

### [23] Overview of iRoC Technologies' Report

"Radiation Results of the SER Test of Actel, Xilinx and Altera FPGA Instances"

http://www.actel.com/documents/OverviewRadResultsIROC.pdf

### [24] Actel Core generator

http://www.actel.com/documents/EDAC\_AN.pdf

![](_page_29_Picture_18.jpeg)