# Enabling Effective Error Mitigation in Modern Memory Chips that Use On-Die ECC #### Minesh Patel Doctoral Examination 1 October 2021 #### **Advisor:** Onur Mutlu (ETH Zürich) #### **Co-Examiners:** Mattan Erez (UT Austin) Moinuddin Qureshi (Georgia Tech) Vilas Sridharan (AMD) Christian Weis (TU Kaiserslautern) #### "Separation of Concerns" between manufacturers **Processor** Main Memory (DRAM) Enables each party to solve their own design challenges #### **Challenge:** DRAM suffers from errors that cause data loss or system failure if ignored Manufacturers' primary goal is to increase storage density, but this **exacerbates** errors - 1. Increases costs for manufacturers and consumers - 2. Limits systems' overall potential for growth ## Solution: Error Mitigation Techniques **Recently**, DRAM manufacturers started using on-die error-correcting codes (on-die ECC) #### Simple and low-cost Preserves **trade secrets** of DRAM manufacturers **Convenient** for many commodity systems Maintaining low costs means a **limited** error correction capability Partial error-correction can **complicate** system design and test ## **Problems Introduced by On-Die ECC** On-die ECC negatively impacts system design and test efforts Predictable and/or well-understood errors due to physical processes **Unknown filtration** from on-die ECC partially correcting the errors Unpredictable, obfuscated errors that are hard to understand or reason about ## Parties Impacted by Obfuscated Errors Anyone who must understand error characteristics in the course of their work is potentially affected #### **Error-Mitigation Designers** Forced to make **limiting assumptions** (e.g., worst-case behavior) that lead to **inefficient designs** #### **Third-Party Testers** Hard to **debug observed errors** because on-die ECC conceals the **underlying cause** #### **Research Scientists** **Experimental studies** of DRAM technology characteristics polluted by **on-die ECC artifacts** #### **Thesis Statement** Exploit the interaction between **on-die ECC** and the **statistical characteristics** of memory errors We can use new memory testing techniques to recover the error characteristics that on-die ECC obfuscates Enable scientists and engineers to make **informed decisions** towards building robust systems ## **Thesis Statement (Verbatim)** #### 1.3 Thesis Statement Our approach is encompassed by the following thesis statement: The error characteristics that on-die ECC obfuscates can be recovered using new memory testing techniques that exploit the interaction between on-die ECC and the statistical characteristics of memory error mechanisms to expose physical cell behavior, thereby enabling scientists and engineers to make informed decisions towards building smarter and more robust systems. #### **Core Contributions** #### **Core Contributions** Recommendations > 12 #### **DRAM Cell** #### **Data Encoding** stores one bit of data #### DRAM cells **leak charge** over time Periodically restores the charge of all cells to prevent data-retention errors Significant performance and energy overhead **REAPER (ISCA'17)** ## Making Refresh More Efficient Only a few cells require frequent refreshing Goal: quickly and efficiently identify the error-prone cells Recommendations ## **Experimental Error Characterization** We study the data-retention error characteristics in 368 real LPDDR4 DRAM chips 1 Cells are **more likely** to fail at an **increased** (1) refresh interval; or (2) temperature 2 Profiling involves a complex **tradeoff space**: (1) **speed**; (2) **coverage**; and (3) **false positives** ## **Reach Profiling** refresh interval ## **Evaluating Reach Profiling** - 2.5x faster than the state-of-the-art baseline for 99% coverage and a 50% false positive rate - Even faster (>3.5x) with more false positives (>100%) - 2. Enables operating at **longer refresh intervals** by reducing the overall profiling overhead - 16.3% end-to-end performance improvement - 36.4% **DRAM power** reduction #### **Core Contributions** Recommendations > 19 #### **Third-Party DRAM Users** **Test Engineers** **Research Scientists** Study **DRAM errors** to understand a DRAM chip's **reliability characteristics** 'Weak' cell locations? Inter-chip variation? Temperature dependence? Statistical error properties? Minimum operating timings? ## **Third-Party DRAM Users** Gain **exploitable insights** to improve performance, reliability, etc. #### **On-Die ECC Interferes with Studying Errors** #### **Well-Understood Error Distributions** - Based on physical properties of DRAM - Easy to reason about and understand #### **Unpredictable Error Distributions** - **Dependent on ECC implementation** - Hard to reason about and predict #### **On-Die ECC Interferes with Studying Errors** #### Our goal: Recover the **error characteristics** that on-die ECC **obfuscates** ## **Key Idea: Statistical Inference** Recommendations ## **EIN: Error Inference Methodology** Choose Experimental Setup e.g., testing parameters, DRAM chips **Simulate Suspected ECCs** e.g., Hamming, BCH, etc. 4 Perform Inference Maximum-a-priori (MAP) estimation ## **Applying EIN to Real Chips** - Apply EIN to 314 real LPDDR4 DRAM chips - •Show that EIN can infer both: - The ECC scheme to be a (136, 128) Hamming code - Raw bit error rates of data-retention errors - •EIN works without: - Visibility into the ECC mechanism - Disabling ECC - Tampering with the hardware #### **Core Contributions** REAPER (ISCA'17) EIN (DSN'19) BEER (MICRO'20) HARP (MICRO'21) Recommendations #### Our goal: Determine exactly how on-die ECC obfuscates errors (i.e., its parity-check matrix) - BEER: Reveals how on-die ECC scrambles errors - BEEP: Enables inferring raw bit error locations #### **Key idea:** disabling DRAM refresh induces data-retention errors only in CHARGED cells Recommendations ## Key idea: disabling DRAM refresh induces data-retention errors only in CHARGED cells **Data-Retention Error** # We can **selectively** induce errors by **controlling** bit-flip directions CHARGED DISCHARGED REAPER (ISCA'17) EIN (DSN'19) **BEER (MICRO'20)** HARP (MICRO'21) Recommendations ## **BEER Testing Methodology** ## **Using BEER in Practice** - BEER determines the parity-check matrix without: - (1) hardware support or tools - (2) prior knowledge about on-die ECC - (3) access to ECC metadata (e.g., syndromes) Open-source C++ tool on GitHub https://github.com/CMU-SAFARI/BEER #### **Experimental demonstration** 80 LPDDR4 DRAM chips (3 major manufacturers) #### **Two-Part Evaluation** #### Simulation of correctness and practicality Over 100,000 representative ECC codes of varying word lengths (4 – 247 bits) - 1. Different manufacturers appear to use different parity-check matrices - 2. Chips of the same model appear to use identical parity-check matrices #### **Two-Part Evaluation** - 1. BEER works for all simulated test cases - 2. BEER is practical in both runtime and memory usage #### **Core Contributions** Recommendations 35 #### Profiling a Memory Chip with On-Die ECC **Unreliable Memory** Goal: understand and address any challenges that on-die ECC introduces for error profiling BEER (MICRO'20) REAPER (ISCA'17) #### Challenges Introduced by On-Die ECC #### **Exponentially increases** the total number of at-risk bits Makes it harder to identify individual at-risk bits **Interferes** with commonly-used data patterns for memory testing BEER (MICRO'20) #### **Key Observation: Two Sources of Errors** **Upper-bounded** by the ECC algorithm 38 REAPER (ISCA'17) Recommendations #### **Key Observation: Two Sources of Errors** Due to errors in the **memory chip** ### **Key Idea**: **decouple** profiling for **direct** and **indirect** errors 2 Indirect error Artifact of the on-die ECC algorithm Upper-bounded by the ECC algorithm #### **Hybrid Active-Reactive Profiling (HARP)** **Active Profiling** Quickly identifies all direct errors with existing profiling techniques using an on-die ECC bypass path Memory **Memory Chip Controller** On-Die **Active Profiler** ECC bypas **Reactive Profiling** **Safely** identifies indirect errors using secondary ECC at least as strong as on-die ECC Recommendations 40 #### **Hybrid Active-Reactive Profiling (HARP)** # HARP reduces the problem of profiling with on-die ECC to profiling without on-die ECC Safely identifies indirect errors using secondary ECC at least as strong as on-die ECC Recommendations **4**1 #### **Evaluations** - HARP improves coverage and performance relative to two state-of-the-art baseline profiling algorithms - E.g., 20.6-62.1% faster to achieve 99<sup>th</sup>-percentile coverage for 2-5 raw-bit errors per on-die ECC word - 2. HARP **outperforms** the best-performing baseline in a case study of mitigating data-retention errors - E.g., 3.7x faster given a per-bit error probability of 0.75 #### We conclude that HARP overcomes all three profiling challenges HARP (MICRO'21) #### **Core Contributions** Recommendations 43 #### Many Ways to Exploit Commodity DRAM Reduce timing/voltage margins e.g., Access and refresh timings Use system-level error mitigations e.g., ECC, redundancy, replication Use security enhancements e.g., RowHammer and Cold-Boot defenses Cost **Security** Reliability **Performance Energy/Power** HARP (MICRO'21) #### Many Ways to Exploit Commodity DRAM Unfortunately, adopting these proposals typically relies on unavailable information about DRAM reliability characteristics (e.g., design characteristics, testing practices, error behavior) e.g., ECC, redundancy, replication Use security enhancements e.g., RowHammer and Cold-Boot defenses **Energy/Power** #### Source of the Problem Commodity DRAM specifications do not provide this information by design - Unfortunately, the opportunity cost of preserving this status quo is increasing - Technology scaling exacerbates refresh, RowHammer, etc. - Many old and new proposals for leveraging this opportunity #### Source of the Problem Commodity DRAM specifications do not provide this information by design Proposal: revisit DRAM specifications to improve information transparency #### **Two-Step Plan for Transparency** - No change to DRAM hardware or design - Just provide information so that system designers can make better informed decisions and reason about their designs #### 1. Short-term: convey basic information - Whatever the manufacturers feel is practical to do so - E.g., basic design properties that can be reverse-engineered #### 2. Long-term: rethink DRAM standards - Incorporate transparency of reliability-related topics - E.g., error models, testing guidelines #### **Core Contributions** #### **Thesis Statement** In a **general** sense in the recommendations ndations 50 #### **Future Research Directions** - Extending the techniques that we propose to other: - ECC types and error mechanisms - ECC architectures and memory technologies - Using the information that our techniques reveal - Improved system-level error mitigations - Error-tolerant computing - Devising alternatives to on-die ECC - Different on-die ECC architectures - System-level error mitigation mechanisms - Improving transparency of DRAM reliability #### My Involvements During the Ph.D. #### Acknowledgments - Onur Mutlu - Mattan Erez, Moinuddin Qureshi, Vilas Sridharan, and Christian Weis - SAFARI group members (from both CMU and ETH) - Internship mentors and industry sponsors - Friends and colleagues all over the world - Family mum, dad, and sister ## Enabling Effective Error Mitigation in Modern Memory Chips that Use On-Die ECC #### Minesh Patel Doctoral Examination 1 October 2021 #### **Advisor:** Onur Mutlu (ETH Zürich) #### **Co-Examiners:** Mattan Erez (UT Austin) Moinuddin Qureshi (Georgia Tech) Vilas Sridharan (AMD) Christian Weis (TU Kaiserslautern)