BEN GRAS @BJG KAVEH RAZAVI @KAVEHRAZAVI CRISTIANO GIUFFRIDA, HERBERT BOS VRIJE UNIVERSITEIT AMSTERDAM HARDWEAR.IO 2018 VUsec - Security Research group at VU Amsterdam - VUsec Security Research group at VU Amsterdam - Academic group researching systems software security - VUsec Security Research group at VU Amsterdam - Academic group researching systems software security - We do software hardening, exploitation - VUsec Security Research group at VU Amsterdam - Academic group researching systems software security - We do software hardening, exploitation - Hardware attacks, side channels - VUsec Security Research group at VU Amsterdam - Academic group researching systems software security - We do software hardening, exploitation - Hardware attacks, side channels ## OVERVIEW - Cache attacks - Cache defences - TLBleed - Evaluation - Reception #### CACHE ATTACKS ## SIDE CHANNELS ## SIDE CHANNELS Leak secrets outside the regular interface # SIDE CHANNELS Leak secrets outside the regular interface ## INSIDE A CPU ## INSIDE A CPU # SIDE CHANNEL ATTACKS ON SHARED RESOURCES - There are **shared** resources between processes - RAM, CPU cache, TLB, computational resources ... - Covert channels - Sometimes: Side channels (spying) - Can attack AES implementation with T tables - A table lookup happens $T_j[x_i = p_i \oplus k_i]$ - $p_i$ is a plaintext byte, $k_i$ a key byte - Again: secrets are betrayed by memory accesses - Known plaintext + accesses = key recovery - Again: secrets are betrayed by memory accesses - Known plaintext + accesses = key recovery Not side channel proof version: Not side channel proof version: ``` void _gcry_mpi_ec_mul_point (mpi_point_t result, gcry_mpi_t scalar, mpi_point_t point, mpi_ec_t ctx) { ... for (j=nbits-1; j >= 0; j--) { _gcry_mpi_ec_dup_point (result, result, ctx); if (mpi_test_bit (scalar, j)) _gcry_mpi_ec_add_points(result,result,point,ctx); } ... } ``` Not side channel proof version: More side channel proof version #### CACHE DEFENCES ## CACHE PARTIONING ## CACHE PARTIONING - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Figure out page colors - These map to shared cache sets - Do not share same colors across security boundaries - Kernel arranges this - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways - Hardware feature - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways - Hardware feature | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |----|----|----|----|----|----|----|----|----|----| | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways - Hardware feature | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |----|----|----|----|----|----|----|----|----|----| | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways - Hardware feature - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways - Hardware feature - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways - Intel CAT: Cache Allocation Technology - Intended for predictable performance for VMs - Partitions caches in ways • Intel TSX: Transactional Synchronization Extensions - Intel TSX: Transactional Synchronization Extensions - Intended for hardware transactional memory - Intel TSX: Transactional Synchronization Extensions - Intended for hardware transactional memory - Transaction working set should fit in cache, otherwise auto-abort - Intel TSX: Transactional Synchronization Extensions - Intended for hardware transactional memory - Transaction working set should fit in cache, otherwise auto-abort - We can use this as a defence - Intel TSX: Transactional Synchronization Extensions - Intended for hardware transactional memory - Transaction working set should fit in cache, otherwise auto-abort - We can use this as a defence - Intel TSX: Transactional Synchronization Extensions - Intended for hardware transactional memory - Transaction working set should fit in cache, otherwise auto-abort - We can use this as a defence TLBLEED - Other structures than cache shared between threads? - What about the TLB? - Documented: TLB has L1iTLB, L1dTLB, and L2TLB - They have sets and ways - Not documented: structure ## TLB IS JUST ANOTHER CACHE - Let's experiment with performance counters - Try linear structure first - All combinations of ways (set size) and sets (stride) - Smallest number of ways is it - Smallest corresponding stride is number of sets - Let's experiment with performance counters - Try linear structure first - All combinations of ways (set size) and sets (stride) - Smallest number of ways is it - Smallest corresponding stride is number of sets - Let's experiment with performance counters - Try linear structure first - All combinations of ways (set size) and sets (stride) - Smallest number of ways is it - Smallest corresponding stride is number of sets • For L2TLB: We reverse engineered a more complex hash function - For L2TLB: We reverse engineered a more complex hash function - Skylake XORs 14 bits, Broadwell XORs 16 bits - For L2TLB: We reverse engineered a more complex hash function - Skylake XORs 14 bits, Broadwell XORs 16 bits - Represented by this matrix, using modulo 2 arithmetic - For L2TLB: We reverse engineered a more complex hash function - Skylake XORs 14 bits, Broadwell XORs 16 bits - Represented by this matrix, using modulo 2 arithmetic Let's experiment with performance counters - Let's experiment with performance counters - Now we know the structure.. Are TLB's shared between hyperthreads? - Let's experiment with performance counters - Now we know the structure.. Are TLB's shared between hyperthreads? - Let's experiment with misses when accessing the same set - Let's experiment with performance counters - Now we know the structure.. Are TLB's shared between hyperthreads? - Let's experiment with misses when accessing the same set - Let's experiment with performance counters - Now we know the structure.. Are TLB's shared between hyperthreads? - Let's experiment with misses when accessing the same set - Let's experiment with performance counters - Now we know the structure.. Are TLB's shared between hyperthreads? - Let's experiment with misses when accessing the same set | | | | L1 dTLB | | | | Ll iTLB | | | | | L2 sTLB | | | | | |---------------|------|-----|--------------|-----|-----|-----|---------|--------------|------|-----|-----|---------|--------------|-------|-------|-----| | Name | year | set | $\mathbf{W}$ | pn | hsh | shr | set | $\mathbf{W}$ | pn | hsh | shr | set | $\mathbf{W}$ | pn | hsh | shr | | Sandybridge | 2011 | 16 | 4 | 7.0 | lin | ✓ | 16 | 4 | 50.0 | lin | X | 128 | 4 | 16.3 | lin | ✓ | | Ivybridge | 2012 | 16 | 4 | 7.1 | lin | ✓ | 16 | 4 | 49.4 | lin | X | 128 | 4 | 18.0 | lin | ✓ | | Haswell | 2013 | 16 | 4 | 8.0 | lin | ✓ | 8 | 8 | 27.4 | lin | X | 128 | 8 | 17.1 | lin | ✓ | | HaswellXeon | 2014 | 16 | 4 | 7.9 | lin | ✓ | 8 | 8 | 28.5 | lin | X | 128 | 8 | 16.8 | lin | ✓ | | Skylake | 2015 | 16 | 4 | 9.0 | lin | ✓ | 8 | 8 | 2.0 | lin | X | 128 | 12 | 212.0 | XOR-7 | ✓ | | BroadwellXeon | 2016 | 16 | 4 | 8.0 | lin | ✓ | 8 | 8 | 18.2 | lin | X | 256 | 6 | 272.4 | XOR-8 | ✓ | | Coffeelake | 2017 | 16 | 4 | 9.1 | lin | ✓ | 8 | 8 | 26.3 | lin | X | 128 | 12 | 230.3 | XOR-7 | ✓ | - We find more TLB properties - Size, structure, sharing, miss penalty, hash function | | | L1 dTLB | | | L1 iTLB | | | | L2 sTLB | | | | | | | | |---------------|------|---------|--------------|-----|---------|-----|-----|--------------|---------|-----|-----|-----|--------------|-------|-------|-----| | Name | year | set | $\mathbf{W}$ | pn | hsh | shr | set | $\mathbf{W}$ | pn | hsh | shr | set | $\mathbf{W}$ | pn | hsh | shr | | Sandybridge | 2011 | 16 | 4 | 7.0 | lin | ✓ | 16 | 4 | 50.0 | lin | X | 128 | 4 | 16.3 | lin | ✓ | | Ivybridge | 2012 | 16 | 4 | 7.1 | lin | ✓ | 16 | 4 | 49.4 | lin | X | 128 | 4 | 18.0 | lin | ✓ | | Haswell | 2013 | 16 | 4 | 8.0 | lin | ✓ | 8 | 8 | 27.4 | lin | X | 128 | 8 | 17.1 | lin | ✓ | | HaswellXeon | 2014 | 16 | 4 | 7.9 | lin | ✓ | 8 | 8 | 28.5 | lin | X | 128 | 8 | 16.8 | lin | ✓ | | Skylake | 2015 | 16 | 4 | 9.0 | lin | ✓ | 8 | 8 | 2.0 | lin | X | 128 | 12 | 212.0 | XOR-7 | ✓ | | BroadwellXeon | 2016 | 16 | 4 | 8.0 | lin | ✓ | 8 | 8 | 18.2 | lin | X | 256 | 6 | 272.4 | XOR-8 | ✓ | | Coffeelake | 2017 | 16 | 4 | 9.1 | lin | ✓ | 8 | 8 | 26.3 | lin | X | 128 | 12 | 230.3 | XOR-7 | ✓ | - Can we use only latency? - Map many virtual addresses to same physical page - Can we use only latency? - Map many virtual addresses to same physical page Let's observe EdDSA ECC key multiplication ``` void _gcry_mpi_ec_mul_point (mpi_point_t result, gcry_mpi_t scalar, mpi_point_t point, mpi_ec_t ctx) { ... for (j=nbits-1; j >= 0; j--) { _gcry_mpi_ec_dup_point (result, result, ctx); if (mpi_test_bit (scalar, j)) _gcry_mpi_ec_add_points(result,result,point,ctx); } ... } ``` - Let's observe EdDSA ECC key multiplication - Scalar is secret and ADD only happens if there's a 1 ``` void _gcry_mpi_ec_mul_point (mpi_point_t result, gcry_mpi_t scalar, mpi_point_t point, mpi_ec_t ctx) { ... for (j=nbits-1; j >= 0; j--) { _gcry_mpi_ec_dup_point (result, result, ctx); if (mpi_test_bit (scalar, j)) _gcry_mpi_ec_add_points(result,result,point,ctx); } ... } ``` - Let's observe EdDSA ECC key multiplication - Scalar is secret and ADD only happens if there's a 1 - But: we can not use code information! Only data..! ``` void _gcry_mpi_ec_mul_point (mpi_point_t result, gcry_mpi_t scalar, mpi_point_t point, mpi_ec_t ctx) { ... for (j=nbits-1; j >= 0; j--) { _gcry_mpi_ec_dup_point (result, result, ctx); if (mpi_test_bit (scalar, j)) _gcry_mpi_ec_add_points(result,result,point,ctx); } ... } ``` - Let's find the spatial L1 DTLB separation - There isn't any - Too much activity in both blue/green cases - Let's find the spatial L1 DTLB separation - There isn't any - Too much activity in both blue/green cases Monitor a single TLB set and use temporal information - Monitor a single TLB set and use temporal information - Use machine learning (SVM classifier) to tell the difference - Monitor a single TLB set and use temporal information - Use machine learning (SVM classifier) to tell the difference - Monitor a single TLB set and use temporal information - Use machine learning (SVM classifier) to tell the difference - Monitor a single TLB set and use temporal information - Use machine learning (SVM classifier) to tell the difference - Monitor a single TLB set and use temporal information - Use machine learning (SVM classifier) to tell the difference ### EVALUATION | Microarchitecture | Trials | Success | Median BF | |-------------------|--------|---------|-----------| | Skylake | 500 | 0.998 | $2^{1.6}$ | | Broadwell | 500 | 0.982 | $2^{3.0}$ | | Coffeelake | 500 | 0.998 | $2^{2.6}$ | | Total | 1500 | 0.993 | | | | | | | | Microarchitecture | Trials | Success | Median BF | |-------------------|--------|---------|-----------| | Skylake | 500 | 0.998 | $2^{1.6}$ | | Broadwell | 500 | 0.982 | $2^{3.0}$ | | Coffeelake | 500 | 0.998 | $2^{2.6}$ | | Total | 1500 | 0.993 | | - Single trace capture: 1ms - Median end-to-end time: 17s | Trials | Success | Median BF | |--------|-------------------|-------------------------------------| | 500 | 0.998 | $2^{1.6}$ | | 500 | 0.982 | $2^{3.0}$ | | 500 | 0.998 | $2^{2.6}$ | | 1500 | 0.993 | | | | 500<br>500<br>500 | 500 0.998<br>500 0.982<br>500 0.998 | - Single trace capture: 1ms - Median end-to-end time: 17s | Trials | Success | Median BF | |--------|-------------------|-----------------------------------------------------------------| | 500 | 0.998 | $2^{1.6}$ | | 500 | 0.982 | $2^{3.0}$ | | 500 | 0.998 | $2^{2.6}$ | | 1500 | 0.993 | | | | 500<br>500<br>500 | 500 0.998 500 0.982 500 0.998 | - Single trace capture: 1ms - Median end-to-end time: 17s | Microarchitecture | Trials | Success | Median BF | |-------------------|--------|---------|-----------| | Broadwell (CAT) | 500 | 0.960 | $2^{2.6}$ | | Broadwell | 500 | 0.982 | $2^{3.0}$ | Intel: same power as cache attacks OpenBSD disabled Intel HT Widespread media coverage, logo thanks to TheRegister Wikipedia Intel: same power as cache attacks OpenBSD disabled Intel HT Widespread media coverage, logo thanks to TheRegister Wikipedia #### CVS: cvs.openbsd.org: src Mark Kettenis | Tue. 19 Jun 2018 12:30:19 -0700 CVSROOT: Module name: 2018/06/19 13:29:52 Changes by: kette...@cvs.openbsd.org Modified files: sys/arch/amd64/amd64: cpu.c sys/arch/amd64/include: cpu.h sys/kern : kern\_sched.c kern\_sysctl.c : sched.h sysctl.h svs/svs Log message: SMT (Simultanious Multi Threading) implementations typically share TLBs and L1 caches between threads. This can make cache timing attacks a lot easier and we strongly suspect that this will make Intel: same power as cache attacks OpenBSD disabled Intel HT Widespread media coverage, logo thanks to TheRegister Wikipedia Intel: same power as cache attacks OpenBSD disabled Intel HT Widespread media coverage, logo thanks to TheRegister Wikipedia #### CVS: cvs.openbsd.org: src Mark Kettenis | Tue. 19 Jun 2018 12:30:19 -0700 CVSROOT: Module name: 2018/06/19 13:29:52 Changes by: kette...@cvs.openbsd.org Modified files: sys/arch/amd64/amd64: cpu.c sys/arch/amd64/include: cpu.h sys/kern : kern\_sched.c kern\_sysctl.c svs/svs : sched.h sysctl.h Log message: SMT (Simultanious Multi Threading) implementations typically share TLBs and L1 caches between threads. This can make cache timing attacks a lot easier and we strongly suspect that this will make #### **TLBleed** From Wikipedia, the free encyclopedia **TLBleed** is a cryptographic side-channel attack that uses machine lea simultaneous multithreading.<sup>[1][2]</sup> As of June 2018, the attack has only vulnerable to a variant of the attack, but no proof of concept has been The attack led to the OpenBSD project disabling simultaneous multithre theoretically be prevented by preventing tasks with different security of #### References [edit] Milliams, Chris (2018-06-22). "Meet TLBleed: A crypto-key-leaking C A <sup>a b c</sup> Varghese, Sam (25 June 2018). "OpenBSD chief de Raadt says Practical, reliable, high resolution side channels exist outside the cache - Practical, reliable, high resolution side channels exist outside the cache - They bypass defences - Practical, reliable, high resolution side channels exist outside the cache - They bypass defences @bjg @kavehrazavi - Practical, reliable, high resolution side channels exist outside the cache - They bypass defences - @bjg @kavehrazavi - @vu5ec - Practical, reliable, high resolution side channels exist outside the cache - They bypass defences - @bjg @kavehrazavi - @vu5ec - www.vusec.net - Practical, reliable, high resolution side channels exist outside the cache - They bypass defences - @bjg @kavehrazavi - @vu5ec - www.vusec.net - Thank you for listening