Database Collation Optimization In TiDB | TiFlash

In database systems, Collation specifies how data is sorted and compared in a database. Collation provides the sorting rules, case, and accent sensitivity properties for the data in the database.

在数据库系统中,CHARACTER SET(aka 字符集) 为一组字符和编码的集合,COLLATION 则表示字符排序|比较规则、大小写敏感等属性。参考 Character Sets and Collations in MySQL,MySQL 支持多种字符集,每种字符集包含多种 collation(默认使用一种)。TiDB 体系中默认采用字符集 utf8mb4,collation utf8mb4_bin。对于 TiFlash 这样的 OLAP 引擎而言,字符集相关规则引入的额外抽象必然会对性能产生较大影响,本文主要阐述优化 collation 的过程并在 Benchmark TPCHClickBench 上验证效果。

需注意的点:

Benchmark

Overall

  • TPCH-100:总体性能提升约 5.55% (结合 LTO)
  • ClickBench:对于 TiFlash 已支持的查询,总体性能提升约 21.06%
  • Ossinsight(PingCAP 内部测试):所用的字符串比较过滤等类似场景,性能提升约 46.58%~75.76%
  • 当前优化主要面向 BIN COLLATION,并针对 TiDB 默认的 UTF8MB4_BIN / UTF8_BIN 特殊优化
    • 注意:ClickBench 数据集中,URL、Title、Referer、SearchPhrase、OriginalURL、OpenstatCampaignID、UTMCampaign、UTMContent 列均包含以空格(0x20)结尾的数据。按照当前 TiDB 体系的行为需 裁剪掉末尾的空格。处理到这类数据时必然产生额外开销,优化效果一般。
  • 本轮优化 不包含 CI COLLATION

TPCH-100

One TiFlash Store

  • TIFLASH x 1
    • memory limit in bytes: 207405139968
    • cpu cores quota: 40
  • TIFLASH REPLICA x 2
    • ALTER TABLE {...} COMPACT TIFLASH REPLICA
  • set @@tidb_enforce_mpp = on
  • 同时对比测试移除 LTO 编译优化(自 TiDB v6.1.0 引入)后的影响;PGO 优化 当前暂无自动使用;
  • 测试程序 go-tpc/query.go · pingcap/go-tpc
Time(s) Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865 Optimized Improvement: (Original) / (Optimized) - 1.0 Original + No LTO (Link Time Optimization) Improvement: (Original + No LTO) / (Optimized) - 1.0
Q1 9.09 8.42 7.96% AGG() by multi STR; COLLATION; 10 18.76%
Q2 2.45 2.38 2.94% 2.52 5.88%
Q3 5.6 5.47 2.38% 5.6 2.38%
Q4 6.14 6.07 1.15% 6.24 2.80%
Q5 13.52 13.05 3.60% 13.52 3.60%
Q6 1.98 1.98 0.00% 2.01 1.52%
Q7 6.34 6.14 3.26% 6.51 6.03%
Q8 8.69 8.36 3.95% 8.93 6.82%
Q9 38.42 38.49 -0.18% 38.82 0.86%
Q10 6.95 6.61 5.14% 7.58 14.67%
Q11 1.64 1.58 3.80% 1.71 8.23%
Q12 4.4 4.26 3.29% 4.46 4.69%
Q13 8.42 7.82 7.67% LIKE(); COLLATION; 8.42 7.67%
Q14 2.11 2.11 0.00% 2.21 4.74%
Q15 4.46 4.46 0.00% 4.73 6.05%
Q16 2.25 2.11 6.64% LIKE(); COLLATION; 2.28 8.06%
Q17 13.32 12.78 4.23% 13.32 4.23%
Q18 18.09 17.41 3.91% 18.49 6.20%
Q19 5.54 4.66 18.88% COLLATION 5.6 20.17%
Q20 2.99 2.92 2.40% 3.02 3.42%
Q21 24.73 24.26 1.94% 25.4 4.70%
Q22 1.85 1.78 3.93% 1.91 7.30%
SUM 188.98 183.12 3.20% 193.28 5.55%

Two TiFlash Store

  • TIFLASH x 2
    • memory limit in bytes: 207405139968
    • cpu cores quota: 40
  • TIFLASH REPLICA x 2
    • ALTER TABLE {...} COMPACT TIFLASH REPLICA
  • set @@tidb_enforce_mpp = on
Time(s) Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865 Optimized Improvement: (Original) / (Optimized) - 1.0
Q1 5.4 4.97 8.65%
Q2 1.91 1.88 1.60%
Q3 4.46 4.4 1.36%
Q4 6.61 6.48 2.01%
Q5 11.54 11.11 3.87%
Q6 1.01 1.01 0.00%
Q7 4.9 4.73 3.59%
Q8 8.36 8.25 1.33%
Q9 30.97 30.06 3.03%
Q10 5.87 5.4 8.70%
Q11 1.51 1.44 4.86%
Q12 2.55 2.52 1.19%
Q13 5.57 5.34 4.31%
Q14 1.17 1.17 0.00%
Q15 2.18 2.21 -1.36%
Q16 1.24 1.21 2.48%
Q17 9.5 9.63 -1.35%
Q18 12.82 12.75 0.55%
Q19 2.99 2.52 18.65%
Q20 2.15 2.11 1.90%
Q21 16.74 16.27 2.89%
Q22 1.01 1.01 0.00%

ClickBench

  • TIFLASH x 1
    • memory limit in bytes: 207405139968
    • cpu cores quota: 40
  • TIFLASH REPLICA x 2
    • ALTER TABLE {...} COMPACT TIFLASH REPLICA
  • Data source: ClickBench
Time(s) Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865 Optimized Improvement: (Original) / (Optimized) - 1.0 Use collation: Y (yes)
Q1 0.276 0.277 -0.36%
Q2 0.029 0.0301 -3.65%
Q3 0.0675 0.0653 3.37%
Q4 0.1813 0.1788 1.40%
Q5 2.285 2.255 1.33%
Q6 1.46 1.36 7.35% Y 优化单 STR 列 GROUP BY
Q7 0.1689 0.1693 -0.24%
Q8 0.0391 0.0381 2.62%
Q9 1.205 1.175 2.55%
Q10 2.005 2 0.25%
Q11 0.2722 0.2521 7.97% Y 短字符串比较过滤;优化 MEM UTILS 基础函数;
Q12 0.2936 0.2731 7.51% Y 同 Q11;优化多 STR 列 GROUP BY;
Q13 1.03 0.9916 3.87% Y
Q14 1.96 1.87 4.81% Y
Q15 1.105 1.06 4.25% Y
Q16 1.08 1.025 5.37%
Q17 3.475 3.36 3.42% Y
Q18 2.865 2.77 3.43% Y
Q19 0 0 ERROR: Out Of Memory Quota!extract 无法下推 TiFlash
Q20 0.5935 0.5773 2.81%
Q21 3.45 0.8726 295.37% Y LIKE 表达式相关:优化字符串搜索算法;avx2 指令优化相关 MEM UTILS 基础函数;
Q22 3.57 1.0151 251.69% Y
Q23 6.645 1.815 266.12% Y
Q24 6.665 4.945 SELECT * … ORDER BY … LIMIT 10;当前需要读全表数据,耗时占比较大,对于 LIMIT 数量较小的场景可延迟物化(先从TiFlash 获取主键,再读 TiKV); 34.78% Y
Q25 0.4399 0.3675 19.70% Y 同 Q11
Q26 0.2029 0.1737 16.81% Y 同 Q11
Q27 0.4221 0.3731 13.13% Y 同 Q11;优化多 key 排序;
Q28 1.345 1.305 3.07% Y
Q29 0 0 ERROR: Out Of Memory Quota!regexp_replace 无法下推 TiFlash Y
Q30 9.655 9.54 1.21%
Q31 0.8385 0.7974 5.15% Y
Q32 1.195 1.185 0.84% Y
Q33 6.98 6.915 0.94%
Q34 6.16 5.945 3.62% Y
Q35 6.115 5.815 5.16% Y
Q36 1.385 1.37 1.09%
Q37 0.2158 0.2122 1.70% Y
Q38 0.1363 0.1328 2.64% Y
Q39 0.1134 0.1071 5.88% Y
Q40 0.4411 0.4261 3.52% Y
Q41 0.0754 0.0746 1.07%
Q42 0.0572 0.0565 1.24%
Q43 0.1397 0.1341 4.18%
SUM 76.6384 63.3055 21.06%

[TBD] ClickBench Enhancement

Optimize Aggregation In ClickBench
  • Q11
    • SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;
    • GROUP BY: STR(MobilePhoneModel), INT(UserID)
  • Q12
    • SELECT MobilePhone, MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10;
    • GROUP BY: INT(MobilePhone), STR(MobilePhoneModel), INT(UserID)
  • Q14
    • SELECT SearchPhrase, COUNT(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10;
    • GROUP BY: STR(SearchPhrase), INT(UserID)
  • Q15
    • SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;
    • GROUP BY: INT(SearchEngineID), STR(SearchPhrase)
  • Q17, Q18
    • GROUP BY: INT(UserID), STR(SearchPhrase)
  • Q23
    • SELECT SearchPhrase, MIN(URL), MIN(Title), COUNT(*) AS c, COUNT(DISTINCT UserID) FROM hits WHERE Title LIKE '%Google%' AND URL NOT LIKE '%.google.%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
    • GROUP BY: INT(UserID), STR(SearchPhrase)
    • GROUP BY 非性能瓶颈
  • Q34, Q35
    • SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10;
    • 经典的 top-n url 问题,TiFlash 的执行计划会按照多节点 Exchange 后聚合汇总。如果仅考虑单节点,则可进一步缩短链路并优化。
  • Q40
    • SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
    • GROUP BY: TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst
延迟物化
  • Q24:SELECT * ... ORDER BY ... LIMIT ... 当前需要读全表数据,耗时占比较大,对于 LIMIT 数量较小的场景可延迟物化(先从TiFlash 获取主键,再读 TiKV)
  • 改写为 select * from hits where _tidb_rowid in (select _tidb_rowid FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10);
    • 查询耗时从 5.07s 变成 1.28s
    • 性能提升 296.09%
JIT
  • 主要面向各类 AGG、JOIN 等计算任务

Ossinsight(PingCAP 内部测试)

  • 模拟类似 Ossinsight 用到的 SQL,使用 TPCH-100 环境
  • 本轮优化不包含 CI COLLATION,对于 Ossinsight TiFlash 中使用 UNICODE CI COLLATION 的场景收效有限
    • 事实情况是 Ossinsight 误用了 UNICODE COLLATION,本应该使用 UTF8 BIN 相关 COLLATION
Time(s) Original: Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865 Optimized Improvement: (Original) / (Optimized) - 1.0
Ossinsight 类似场景:字符串比较过滤(tpch-100 数据集模拟)
select count(1) from lineitem where L_SHIPMODE = ‘zzzz’; 1.07 0.73 46.58% 优化短字符串比较过滤 varchar(utf8mb4_bin): const-char(utf8mb4_bin)
select count(1) from lineitem where L_RETURNFLAG = ‘R’; 1.16 0.66 75.76%
字符串排序
select min(L_SHIPMODE) from lineitem; 0.93 0.76 22.37% avx2 指令优化基础函数 memcmp
select max(L_SHIPMODE) from lineitem; 1.11 0.84 32.14%

优化过程

Improve The Performance Of New Collation Releated Function And Executor

优化 Multi Key Sort

tiflash#5908

  • 关键点在于去虚拟化。单 Key 处理比较容易,多 Key 场景则要么 JIT,要么手动展开。此处用模板,当 Key 数量为 2 时,对常见的几个类型进行展开处理。
    • UInt64
    • Int64
    • StringBin
    • StringBinPadding
    • StringWithCollatorGeneric
  • TODO:对于 Key 数量大于等于 3 的场景,这种方法模板膨胀较严重,可能需要按需求设置快速路径。

Benchmark

  • tpch-10/tpch-100/clickbench
  • tiflash x 1
  • original: 6d0cbc8
  • data: clickbench
  • limit cpu up to 200%

  • SQL
    • SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime, SearchPhrase LIMIT 10;
  • sort key 为 UINT64,STR(utf8 collator)
  • sort 开销占比较小
Time(s) Original Optimized Improvement
4.56 4.42
4.54 4.53
4.49 4.34
4.56 4.48
4.56 4.42 AVG(Original) / AVG(Optimized) - 1.0
AVG 4.542 4.438 Optimized : Original 2.34%

  • limit cpu up to 2000%
  • SQL
    • SELECT ROW_NUMBER() OVER w1 FROM PART window w1 AS (PARTITION BY P_MFGR order by P_SIZE);
  • sort key 为 STR(utf8 collator),UINT64
  • sort 开销占比较大
Time(s) Original Optimized Improvement
8.16 7.18
8.22 6.94
8.29 7.35
8.34 7.04
8.24 7.37 AVG(Original) / AVG(Optimized) - 1.0
AVG 8.25 7.176 Optimized : Original 14.97%

  • limit cpu up to 2000%
  • SQL
    • EXPLAIN analyze SELECT ROW_NUMBER() OVER w1 FROM PART window w1 AS (PARTITION BY p_name ORDER BY p_partkey);
  • sort key 为 INT64,STR(utf8 collator)
  • sort 开销占比约 30%
Time(s) Original Optimized Improvement
9.71 9.28
9.62 9.41
9.76 9.25
9.73 9.26
9.67 9.28 AVG(Original) / AVG(Optimized) - 1.0
AVG 9.698 9.296 Optimized : Original 4.32%

  • limit cpu up to 2000%
  • SQL
1
2
3
EXPLAIN analyze SELECT ROW_NUMBER() OVER w1 FROM PART window w1 AS (PARTITION BY P_MFGR ORDER BY p_name);
| └─Sort_13 | 20000000.00 | 20000000 | mpp[tiflash] | | tiflash_task:{time:17.6s, loops:309, threads:8} | tpch100_new.part.p_mfgr, tpch100_new.part.p_name, stream_count: 8 | N/A | N/A |
| └─ExchangeReceiver_12 | 20000000.00 | 20000000 | mpp[tiflash] | | tiflash_task:{time:7.18s, loops:1254, threads:8} | stream_count: 8 | N/A | N/A |
  • sort key 为 STR(utf8 collator),STR(utf8 collator)
  • sort 开销占比较大
Time(s) Sort end Sort start Original: sort_end - sort_start Sort end Sort start Optimized: sort_end - sort_start Improvement
17.6 7.18 10.42 15.6 6.15 9.45
16.9 6.58 10.32 15.2 5.82 9.38
17.8 7.07 10.73 15.7 5.95 9.75
17.1 6.88 10.22 15.9 5.87 10.03
18.8 8.04 10.76 16 6.07 9.93 AVG(Original) / AVG(Optimized) - 1.0
AVG 10.49 9.708 Optimized : Original 8.06%
Time(s) Original Optimized Improvement
18.3 16.34
18.27 16.62
18.7 16.57
18.57 16.52
18.55 16.63 AVG(Original) / AVG(Optimized) - 1.0
AVG 18.478 16.536 Optimized : Original 11.74%

优化 Aggregation And Join

tiflash#6135

tiflash#5834

  • 关键点在于去虚拟化。TiFlash 和 Clickhouse 代码中已针对多种场景手动展开:
    • 常见类型的单 Agg/Join Key,多整型数 Key,部分 Nullable Key,等。
  • 复杂多 Key 场景则要么 JIT,要么手动展开。Clickhouse 代码中针对 Agg 场景已有 JIT 支持,默认不开启。
  • 此处用模板,当 Key 数量小于等于 2 时,对常见的几个类型进行展开处理
  • TODO:对于 Key 数量大于等于 3 的场景,这种方法模板膨胀较严重,可能需要按需求设置快速路径。

Benchmark

  • limit cpu up to 2000%
  • TiFlash x 1
  • TIFLASH REPLICA x 1
  • original commit: 49ca973
Time(s) Original Optimized Improvement: (Original) / (Optimized) - 1.0
ClickBench
Q14 2.075 1.835 13.08%
Q15 1.21 1.155 4.76%
Q17 3.695 3.365 9.81%
Q18 2.925 2.805 4.28%
Q34 6.4 6.085 5.18%
Q35 6.575 6.28 4.70%
TPCH-100
Q1 7.28 7.08 2.82%

  • TiFlash x 1
  • Data: tpch-10
  • original a8c8cb1
  • limit cpu up to 500%
  • SQL select max(l_comment) from lineitem;
Time(s) Original Optimized Improvement
7.71 7.05
7.75 6.96
7.95 7.07
7.61 7.23
7.83 7.07 AVG(Original) / AVG(Optimized) - 1.0
AVG 7.77 7.076 Optimized : Original 9.81%

  • Data: tpch-100
  • limit cpu up to 2000%
  • SQL: tpch Q1
Time(s) Original Optimized Improvement
11.48 10.68
10.96 10.5
11.18 10.66
11.16 10.67
11.18 10.51 AVG(Original) / AVG(Optimized) - 1.0
AVG 11.192 10.604 Optimized : Original 5.55%

CPU Cache 踩坑

tiflash#5996

内部测试中存储模块出现性能下降 tiflash#5949,其根因在于 commit dfac6a5 将默认 memcpy 函数替换为 __folly_memcpy。在 commit 的 benchmark 中,可知 __folly_memcpy 在小数据下(size < 80 B)性能不如之前的实现,但对于存储而言,这类小数据拷贝在全局所占比例不高,理论上不应该产生如此巨大的负面影响。

Non-temporal Memory Copy

分析代码可知,__folly_memcpy 定义了一个阈值 NON_TEMPORAL_STORE_THRESHOLD(32768 即 32K),其作用在于如果需要拷贝的内存大于阈值且双方地址均已对齐,就利用 NON-TEMPORAL 的方式来减少对 CPU Cache 的污染。

现代 CPU 普遍采用多级缓存,以 AMD Ryzen 9 5900X 为例,其在 WSL2 中参数信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
getconf -a | grep CACHE
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 524288
LEVEL2_CACHE_ASSOC 8
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 67108864
LEVEL3_CACHE_ASSOC 0
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0

通常每个 CPU 含有独立的 L1 和 L2 缓存,L3 缓存为所有核共享。L1 缓存分为 L1i(存储指令) 和 L1d(存储数据),L2 和 L3 缓存不区分指令和数据。参考 7-cpu.com/cpu/Zen2amd/microarchitectures/zen_3,缓存数据延迟约为:

1
2
3
4
5
6
7
8
9
10
11
L1 Data Cache Latency:
4 cycles for simple access via pointer
5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).

AMD DOC: 7- or 8-cycle FPU load-to-use latency.

L2 Cache Latency = 12 cycles

L3 Cache Latency = 38 cycles

RAM Latency = 38 cycles + 66 ns

绝大多数情况下,多级缓存机制可以显著增强读写内存数据的性能。但对于已知的 2 种场景而言,刷新 CPU 缓存的意义不大:

  • 临时内存拷贝,即内存数据短期不再被用到
  • 大块内存拷贝,需要拷贝的内存大小超过 CPU 缓存,容易导致其他模块的有效缓存被刷掉,造成缓存污染

glibc 的实现也考虑到了这一点,例如在 Debian GLIBC 2.28-10 中,memcpy 函数的一处逻辑会判断需拷贝的内存大小,超过 60MB(mallwatch@@GLIBC_2.2.5+0x8) 则用 movntdq 指令将数据从寄存器写到内存。该类指令采用 Non-temporal Hint 来避免 CPU 缓存数据。与之对应的 movntdqa 指令则是将数据从内存读取到寄存器。

1
2
3
4
5
6
7
8
9
10
11
/lib/x86_64-linux-gnu/libc.so.6

GNU C Library (Debian GLIBC 2.28-10) stable release version 2.28.
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 8.3.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Debian GNU/Linux 10.9
objdump -r -C -D /lib/x86_64-linux-gnu/libc.so.6

a2232: 48 3b 15 3f e2 11 00 cmp 0x11e23f(%rip),%rdx # 1c0478 <mallwatch@@GLIBC_2.2.5+0x8>
a2239: 0f 87 cc 00 00 00 ja a230b <memcpy@GLIBC_2.2.5+0x2eb>
...
a230b: 4c 8d 14 17 lea (%rdi,%rdx,1),%r10
a230f: 4c 39 d6 cmp %r10,%rsi
a2312: 0f 82 27 ff ff ff jb a223f <memcpy@GLIBC_2.2.5+0x21f>
a2318: 0f 18 8e 80 00 00 00 prefetcht0 0x80(%rsi)
a231f: 0f 18 8e c0 00 00 00 prefetcht0 0xc0(%rsi)
a2326: 0f 10 06 movups (%rsi),%xmm0
a2329: 0f 10 4e 10 movups 0x10(%rsi),%xmm1
a232d: 0f 10 56 20 movups 0x20(%rsi),%xmm2
a2331: 0f 10 5e 30 movups 0x30(%rsi),%xmm3
a2335: 48 83 c6 40 add $0x40,%rsi
a2339: 48 83 ea 40 sub $0x40,%rdx
a233d: 66 0f e7 07 movntdq %xmm0,(%rdi)
a2341: 66 0f e7 4f 10 movntdq %xmm1,0x10(%rdi)
a2346: 66 0f e7 57 20 movntdq %xmm2,0x20(%rdi)
a234b: 66 0f e7 5f 30 movntdq %xmm3,0x30(%rdi)
a2350: 48 83 c7 40 add $0x40,%rdi
a2354: 48 83 fa 40 cmp $0x40,%rdx
a2358: 77 be ja a2318 <memcpy@GLIBC_2.2.5+0x2f8>
a235a: 0f ae f8 sfence
a235d: 0f 11 29 movups %xmm5,(%rcx)
a2360: 0f 11 71 f0 movups %xmm6,-0x10(%rcx)
a2364: 0f 11 79 e0 movups %xmm7,-0x20(%rcx)
a2368: 44 0f 11 41 d0 movups %xmm8,-0x30(%rcx)
a236d: 41 0f 11 23 movups %xmm4,(%r11)
a2371: c3 retq
1
2
3
4
5
6
7
8
9
// test.1.cpp
#include <cstdint>
#include <cstdio>
extern uint64_t mallwatch;
int main() {
// mallwatch@@GLIBC_2.2.5+0x8
printf("%.2f MB\n", (&mallwatch)[1] / 1024.0 / 1024.0);
return 0;
}
1
2
filename=`mktemp` && clang++ -fPIC test.1.cpp -o ${filename} && ${filename} && rm ${filename}
60.00 MB

由于内存相对 CPU 缓存的数据延迟存在数量级差距,对于马上会被复用的内存数据,使用 NON-TEMPORAL 的方式进行拷贝无疑会严重拖慢性能。尤其是对于 OLAP 类的引擎而言,大块内存的使用较为频繁,那么该如何选择启用 NON-TEMPORAL 的时机?

  • 服务器端虚拟化 CPU 常见缓存大小为 L1 32KB,L2 256KB,L3 25MB~64MB
  • folly 设置了固定的阈值 32KB
  • Debian GLIBC 2.28-10 用全局变量作为阈值,进程初始化时设置,大约是 L3 缓存
    • 其他版本也有采用 L1 缓存相关大小作为阈值,不同环境具体实现差异较大
  • TiFlash 默认不用这种方式
    • 对于大部分存储|计算场景,最小数据单元为 Block(通常包含 8192 行数据)。即便是单 int64 类型列,大小也至少有 8192 * 8 = 64KB。如果是 str 类型列,每行至少还有 2 个 uint64 字段表示 offset 和 size,则总量大于 8192 * (8 + 8 + 1) = 136KB
    • 在 memcpy 函数中启用 NON-TEMPORAL 的合理阈值至少是大于 L2 缓存,具体得根据业务形态调整
    • 实际场景中,如果需要进行大块临时内存拷贝,最好在逻辑实现上直接选择 NON-TEMPORAL 的实现而非调用 memcpy

优化 memcpy

tiflash#6281

avx2_inline_memcpy

  • inline 实现需保证代码不过于复杂,尽可能紧凑,可牺牲些许性能(这类特定场景编译器生成的汇编码一般不如人工手写),否则不如用非 inline 的汇编实现。
  • 为 size 不大于 32 字节的场景设置优先路径
    • size 以 16,8,4,2 进行划分,对应寄存器的大小
    • 为了 benchmark 结果好看,分支排列顺序可以优先大 size。本文的实现则是判断 8~16 优先于 16~32。
    • 真实场景中,这部分排列所能影响的分支预测|跳转开销占比相对较小
  • 由于 memcpy 要求 src 和 dst 数据范围不可重叠,前后各按可用的寄存器大小进行拷贝
  • 对于 size 大于 256 的场景,先将 dst 地址按照 32(ymm 寄存器大小)对齐
  • Software Prefetch(使用内存预取指令)对于已经在 CPU 缓存内的数据收益不大,读写连续的数据,现代 CPU 大多已有 Hardware Prefetch 来优化

Benchmark of avx2_inline_memcpy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Run on (40 X 2386.24 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x20)
L1 Instruction 32 KiB (x20)
L2 Unified 256 KiB (x20)
L3 Unified 25600 KiB (x2)
Load Average: 6.72, 6.65, 6.07
-------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------------------------
MemUtilsCopy_1_20_3_true_20000/stl_mempy/iterations:500 446194 ns 446197 ns 500
MemUtilsCopy_1_20_3_true_20000/inline_clickhouse_memcpy/iterations:500 140571 ns 140528 ns 500
MemUtilsCopy_1_20_3_true_20000/sse2_memcpy/iterations:500 150496 ns 150321 ns 500
MemUtilsCopy_1_20_3_true_20000/avx2_memcpy/iterations:500 137616 ns 137586 ns 500
MemUtilsCopy_1_20_3_true_20000/folly_memcpy/iterations:500 181997 ns 182004 ns 500
MemUtilsCopy_1_40_3_true_20000/stl_mempy/iterations:500 456534 ns 456549 ns 500
MemUtilsCopy_1_40_3_true_20000/inline_clickhouse_memcpy/iterations:500 176257 ns 176264 ns 500
MemUtilsCopy_1_40_3_true_20000/sse2_memcpy/iterations:500 178094 ns 178086 ns 500
MemUtilsCopy_1_40_3_true_20000/avx2_memcpy/iterations:500 165250 ns 165256 ns 500
MemUtilsCopy_1_40_3_true_20000/folly_memcpy/iterations:500 186537 ns 186544 ns 500
MemUtilsCopy_1_80_3_true_20000/stl_mempy/iterations:500 476534 ns 476510 ns 500
MemUtilsCopy_1_80_3_true_20000/inline_clickhouse_memcpy/iterations:500 238987 ns 238996 ns 500
MemUtilsCopy_1_80_3_true_20000/sse2_memcpy/iterations:500 302893 ns 302887 ns 500
MemUtilsCopy_1_80_3_true_20000/avx2_memcpy/iterations:500 231339 ns 231348 ns 500
MemUtilsCopy_1_80_3_true_20000/folly_memcpy/iterations:500 200658 ns 200665 ns 500
MemUtilsCopy_1_200_3_true_20000/stl_mempy/iterations:500 660466 ns 660356 ns 500
MemUtilsCopy_1_200_3_true_20000/inline_clickhouse_memcpy/iterations:500 440048 ns 439948 ns 500
MemUtilsCopy_1_200_3_true_20000/sse2_memcpy/iterations:500 427743 ns 427739 ns 500
MemUtilsCopy_1_200_3_true_20000/avx2_memcpy/iterations:500 198850 ns 198876 ns 500
MemUtilsCopy_1_200_3_true_20000/folly_memcpy/iterations:500 243985 ns 243973 ns 500
MemUtilsCopy_1_2000_3_true_20000/stl_mempy/iterations:500 1535333 ns 1535242 ns 500
MemUtilsCopy_1_2000_3_true_20000/inline_clickhouse_memcpy/iterations:500 847673 ns 847547 ns 500
MemUtilsCopy_1_2000_3_true_20000/sse2_memcpy/iterations:500 793617 ns 793608 ns 500
MemUtilsCopy_1_2000_3_true_20000/avx2_memcpy/iterations:500 682924 ns 682915 ns 500
MemUtilsCopy_1_2000_3_true_20000/folly_memcpy/iterations:500 662669 ns 662652 ns 500

当内存拷贝的数据量较大(例如大于 L3 Cache Size x 2),则内存 I/O 成为瓶颈,此时各类实现性能差异不大

优化 memcmp memequal memchr strstr

tiflash#5658

  • 基于 avx2 实现 avx2_mem_cmpavx2_mem_equal
    • 原本基于 avx512 的 mem_utils::memoryEqual 面向小字符串比较性能比 std::memcmp 更慢,重新基于 avx2 实现
  • 基于 avx2 实现字符串搜索相关基础函数 avx2_memchravx2_strstr

Implementation Of Mem Utils

avx2_mem_cmp
avx2_mem_equal

  • 为 size 不大于 32 的场景设置优先路径
    • 本文的实现采用了较为简单的 switch case 方式,也可以划分 size 并按需调整分支排列顺序
  • 对于 size 大于 256 的场景,先将地址按照 32(ymm 寄存器大小)对齐
  • mem cmp 相较于 mem equal 稍微复杂,本质上都是利用 SIMD 指令按批判断数据是否相等,对于不等的情况,mem cmp 还需找到不等的字节并读取到 32 位寄存器中进行相减
  • 处理连续内存,暂无需考虑预取

avx2_memchr
avx2_strstr

  • 字符串直接匹配搜索的算法有很多,典型的类似 KMP / Aho–Corasick 之流会用到状态机。std::string_view::find 则是调用 memchr 找到目标的第一个字符,再通过 bcmp 判断子串是否相等并循环往复。
  • 可以假设,通常情况下用于匹配的数据是有意义的且含有一定的特征,状态机算法引入额外的空间开销以及生成构建开销反而低效。
    • 如果考虑非朴素场景,则可以另外进行针对性优化,例如搜索引擎所用的倒排索引。
  • avx2_strstr 实现上则是通过 avx2_memchr 找到第一个目标字符,再通过 avx2_mem_equal 判断相等,以此循环直到完全匹配
  • avx2_strstr.h#L195-L248 考虑到绝大部分场景下目标字符串都不大,所以此处实现上是用模板封装 size 不大于 16 的场景,进一步内联优化以减少分支
  • 处理内存对齐需注意的点:avx2_strstr.h#L126-L159
    • 内存分配器以 Page 为基准单元向操作系统申请并管理内存
      • Page 大小通常为 4KB,至少是一个 block(512B)
      • 如果内存地址 S 是合法的,则 [ALIGN_TO_PAGE_SIZE(S), ALIGN_TO_PAGE_SIZE(S) + PAGE_SIZE) 范围内均是合法地址
    • 为减少分支,此处直接先将地址按 32 对齐并通过 avx2 指令计算标志位。对于多算的部分,则可根据对齐时的偏移去除。avx2_strstr.h#L35-L58 末尾处理也是同理。
  • avx2_strstr.h#L77-L116 按照 128 个字节做批处理,优先乐观检测过滤

Benchmark Of Mem Utils

  • 参考 Benchmark Collation Impact 中的测试用例
  • original 8404e656
Time(ns) Original Optimized Improvement: (Original) / (Optimized) - 1.0
CollationEqBench/UTF8MB4_BIN 12428711 6228798 99.54%
CollationEqBench/UTF8_BIN 12956705 6141843 110.96%
CollationEqBench/ASCII_BIN 12625723 6229335 102.68%
CollationEqBench/BINARY 11870078 5837615 103.34%
CollationEqBench/LATIN1_BIN 13768201 6732640 104.50%
CollationLikeBench/UTF8MB4_BIN 37940667 20185747 87.96%
CollationLikeBench/UTF8_BIN 37803575 19914106 89.83%
CollationLikeBench/ASCII_BIN 36860160 17999743 104.78%
CollationLikeBench/BINARY 37449881 17599053 112.79%
CollationLikeBench/LATIN1_BIN 37503432 17675036 112.18%
  • 测试 STL 库中的 bcmpmem_utils::memoryEqual 以及自定义 avx2_mem_equal 性能对比
  • 测试 STL 库中的 std::string_view::find 和自定义 avx2_strstr 性能对比
Time(ns) STL Original-avx512 Optimized-avx2 Improvement: (STL) / (Optimized) - 1.0 Improvement: (Original) / (Optimized) - 1.0
check mem eq: MemUtilsEqual_${str-size}
MemUtilsEqual_13 4.46 10.3 3.21 38.94% 220.87%
MemUtilsEqual_65 5.25 9.83 4.44 18.24% 121.40%
MemUtilsEqual_100 9.31 11.3 5.32 75.00% 112.41%
MemUtilsEqual_10000 299 377 213 40.38% 77.00%
MemUtilsEqual_100000 3657 4009 3382 8.13% 18.54%
MemUtilsEqual_1000000 62265 53157 52600 18.37% 1.06%
str find: MemUtilsStrStr_${src-str-size}_${needle-str-size}
MemUtilsStrStr_1024_1 30882 21275 45.16%
MemUtilsStrStr_1024_7 34927 21279 64.14%
MemUtilsStrStr_1024_15 39364 23161 69.96%
MemUtilsStrStr_1024_31 40628 29435 38.03%
MemUtilsStrStr_1024_63 37381 26141 43.00%
MemUtilsStrStr_80_1 6130 3977 54.14%
MemUtilsStrStr_80_7 11720 6278 86.68%
MemUtilsStrStr_80_15 11585 5423 113.63%
MemUtilsStrStr_80_31 11467 9530 20.33%
  • 测试 STL 库中 memcmp 同自定义 avx2_mem_cmp 对比
Time(ns) STL: (GNU libc) 2.17 Optimized-avx2 Improvement: (STL) / (Optimized) - 1.0
MemUtilsCmp_${str-size}_${loop_times}: check mem-cmp for str for specific times
MemUtilsCmp_2_20 66.5 51.9 28.13%
MemUtilsCmp_13_20 75.3 66 14.09%
MemUtilsCmp_65_20 126 106 18.87%
MemUtilsCmp_100_20 167 106 57.55%
MemUtilsCmp_10000_20 5145 3740 37.57%
MemUtilsCmp_100000_20 81996 68577 19.57%
MemUtilsCmp_1000000_20 1254279 1112721 12.72%
  • 测试分别使用 avx2_strstrstd::string_view::findLIKE() 表达式计算性能
1
select count(1) from orders where o_comment like '%pending%deposits%';
Time(s) Original Optimized Improvement
10.75 8.72
10.92 8.87
10.98 8.35
10.7 8.64
10.77 8.5 AVG(Original) / AVG(Optimized) - 1.0
AVG 10.824 8.616 Optimized : Original 25.63%

优化字符串比较

字符串比较

tiflash#5299

  • 早期版本的实现为 collation 处理引入了虚函数 compare / sortKey 用于字符串比较以及获取排序键。这种方式面向大计算量的场景比较低效。
  • 此处针对 TiDB 默认的 BIN 系列 collation,在处理 str 列时去虚拟化。

Benchmark

  • Data: tpch-10
  • tiflash x 1
  • limit cpu up to 200%
1
2
3
4
5
6
MySQL [tpch_10]> select count(1) from lineitem where l_comment = 'zzle? slyly regular instruc       ';
+----------+
| count(1) |
+----------+
| 1 |
+----------+
Time(s) Original Optimized NoCollation Improvement
2.46 1.56 1.48
2.41 1.56 1.38
2.36 1.47 1.43
2.35 1.55 1.4 NoCollation : Original 69.30%
2.44 1.49 1.41 Optimized : Original 57.54%
AVG 2.404 1.526 1.42 NoCollation : Optimized 7.46%

  • SQL select count(1) from lineitem where l_comment < l_comment;
Time(s) Original Optimized NoCollation Improvement
2.32 2.07 1.91
2.27 2.06 1.95
2.33 2.07 2.05
2.33 2.09 1.99 NoCollation : Original 15.49%
2.23 2.08 2.04 Optimized : Original 10.70%
AVG 2.296 2.074 1.988 NoCollation : Optimized 4.33%

constant 字符串比较

tiflash#5569

  • 重点优化 str 列同 constant str 比较 select ... from ... where xxx = 'xxx' ...。例如 select count(*) from github_events where actor_login != 'zzzzzzz'
  • CollationOperatorOptimized.h#L261-L290 面向 size 为 [0, 16] 的 constant str 利用模板展开计算

Benchmark

  • tpch-100
  • tiflash x 1
  • limit cpu up to 200%
  • original commit: 30fc64c
  • SQL: select count(1) from lineitem where L_SHIPMODE = 'zzzz';
Time(s) Original Optimized Improvement
9.15 7.52
9.33 7.62
9.12 7.58
9.23 7.57
9.14 7.65 AVG(Original) / AVG(Optimized) - 1.0
AVG 9.194 7.588 Optimized : Original 21.16%

SQL 参考 tpch-q10: select count(1) from lineitem where L_RETURNFLAG = 'R';

Time(s) Original Optimized Improvement
12.85 8.56
12.87 8.64
12.86 8.45
12.75 8.51
12.76 8.64 AVG(Original) / AVG(Optimized) - 1.0
AVG 12.818 8.56 Optimized : Original 49.74%

优化字符串搜索

tiflash#5489

  • 早期实现 Collator.cpp#L71-L172 是将字符逐个按照 collation 解析出来,再按照状态机匹配。其中包含太多虚函数调用,而且算法较为低效。
  • 重点优化 BIN COLLATION 处理表达式 LIKE() ESCAPE() 的行为 CollationStringSearchOptimized.h#L28-L419
    • utf8 字符有明确的前缀编码,binary 更为简单,仅表示二进制字符。对于大小写敏感的场景,可以直接将 pattern 字符串进行划分,并按需进行字符串搜索(便于 SIMD 向量化处理)
    • 承袭至 CK 的 ASCIICaseSensitiveStringSearcherVolnitsky 主要面向搜索引擎,在小字符串的场景中效果不如 std::string_view::find()Aho–Corasick 之流会用到复杂状态机,根据 上文分析,效果估计会更差。
    • 此 PR 实现上先用了 std::string_view::find(),后续利用 avx2_strstr 优化后在此基础上还有 25.63% 的性能提升。

Benchmark

  • tpch-100
  • tiflash x 1
  • limit cpu up to 200%
  • original commit: a476307
  • SQL: select count(1) from orders where o_comment like '%pending%deposits%';
Time(s) Original Optimized Improvement
35.77 11
33.84 10.88
35.32 11.11
34.82 11.21
34.94 10.97 AVG(Original) / AVG(Optimized) - 1.0
AVG 34.938 11.034 Optimized : Original 216.64%

优化字符串排序

tiflash#5375

  • 面向 BIN COLLATION 去虚拟化
    • TiDB 中 padding 的实现方式与 MySQL 的不同。在 MySQL 中,padding 是通过补齐空格实现的。而在 TiDB 中 padding 是通过裁剪掉末尾的空格来实现的。两种做法在绝大多数情况下是一致的,唯一的例外是字符串尾部包含小于空格 (0x20) 的字符时,例如 'a' < 'a\t' 在 TiDB 中的结果为 1,而在 MySQL 中,其等价于 'a ' < 'a\t',结果为 0。
  • 假设大部分场景中字符串均不含末尾空格,乐观检测并执行快速路径 CollatorUtils.h#L55-L61

Benchmark

  • tpch-100
  • tiflash x 1
  • limit cpu up to 200%
  • original commit: 97342db
  • SQL select min(L_SHIPMODE) from lineitem;
Time(s) Original Optimized NoCollation Improvement
11.35 9.98 9.81
11.33 9.98 9.88
11.23 10 9.84
11.22 9.82 9.73 NoCollation : Original 15.22%
11.58 9.95 9.96 Optimized : Original 14.04%
AVG 11.342 9.946 9.844 NoCollation : Optimized 1.04%

  • SQL select max(L_SHIPMODE) from lineitem;
Time(s) Original Optimized NoCollation Improvement
13.56 12.62 12.77
13.74 12.51 12.27
13.35 12.61 12.32
13.63 12.63 12.45 NoCollation : Original 9.13%
13.52 12.66 12.32 Optimized : Original 7.57%
AVG 13.56 12.606 12.426 NoCollation : Optimized 1.45%

Others

Hardware Prefetch

Wiki/Cache-Prefetching

Check Hardware Prefetch enabled: ref deater/uarch-configure/intel-prefetch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
/* Disable the hardware prefetcher on:        */
/* Core2 */
/* Nehalem, Westmere, SandyBridge, IvyBridge, Haswell and Broadwell */
/* See: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors */
/* */
/* The key is MSR 0x1a4 */
/* bit 0: L2 HW prefetcher */
/* bit 1: L2 adjacent line prefetcher */
/* bit 2: DCU (L1 Data Cache) next line prefetcher */
/* bit 3: DCU IP prefetcher (L1 Data Cache prefetch based on insn address) */
/* */
/* This code uses the /dev/msr interface, and you'll need to be root. */
/* */
/* by Vince Weaver, vincent.weaver _at_ maine.edu -- 26 February 2016 */

#define CORE2_PREFETCH_MSR 0x1a0
#define NHM_PREFETCH_MSR 0x1a4

#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#include <sys/syscall.h>

static int open_msr(int core) {

char msr_filename[BUFSIZ];
int fd;

sprintf(msr_filename, "/dev/cpu/%d/msr", core);
fd = open(msr_filename, O_RDONLY);
if (fd < 0) {
if (errno == ENXIO) {
fprintf(stderr, "rdmsr: No CPU %d\n", core);
exit(2);
} else if (errno == EIO) {
fprintf(stderr, "rdmsr: CPU %d doesn't support MSRs\n", core);
exit(3);
} else {
perror("rdmsr:open");
fprintf(stderr, "Trying to open %s\n", msr_filename);
exit(127);
}
}

return fd;
}

static long long read_msr(int fd, int which) {

uint64_t data;

if (pread(fd, &data, sizeof data, which) != sizeof data) {
perror("rdmsr:pread");
exit(127);
}

return (long long)data;
}

/* FIXME: should really error out if not an Intel CPU */
static int detect_cpu(void) {

FILE *fff;

int family, model = -1;
char buffer[BUFSIZ], *result;
char vendor[BUFSIZ];
int is_core2 = -1;

fff = fopen("/proc/cpuinfo", "r");
if (fff == NULL)
return -1;

while (1) {
result = fgets(buffer, BUFSIZ, fff);
if (result == NULL)
break;

if (!strncmp(result, "vendor_id", 8)) {
sscanf(result, "%*s%*s%s", vendor);

if (strncmp(vendor, "GenuineIntel", 12)) {
printf("%s not an Intel chip\n", vendor);
return -1;
}
}

if (!strncmp(result, "cpu family", 10)) {
sscanf(result, "%*s%*s%*s%d", &family);
if (family != 6) {
printf("Wrong CPU family %d\n", family);
return -1;
}
}

if (!strncmp(result, "model", 5)) {
sscanf(result, "%*s%*s%d", &model);
}
}

fclose(fff);

switch (model) {
case 26:
case 30:
case 31: /* nhm */
printf("Found Nehalem CPU\n");
is_core2 = 0;
break;

case 46: /* nhm-ex */
printf("Found Nehalem-EX CPU\n");
is_core2 = 0;
break;

case 37:
case 44: /* wsm */
printf("Found Westmere CPU\n");
is_core2 = 0;
break;

case 47: /* wsm-ex */
printf("Found Westmere-EX CPU\n");
is_core2 = 0;
break;

case 42: /* snb */
printf("Found Sandybridge CPU\n");
is_core2 = 0;
break;

case 45: /* snb-ep */
printf("Found Sandybridge-EP CPU\n");
is_core2 = 0;
break;

case 58: /* ivb */
printf("Found Ivybridge CPU\n");
is_core2 = 0;
break;

case 62: /* ivb-ep */
printf("Found Ivybridge-EP CPU\n");
is_core2 = 0;
break;

case 60:
case 69:
case 70: /* hsw */
printf("Found Haswell CPU\n");
is_core2 = 0;
break;

case 63: /* hsw-ep */
printf("Found Haswell-EP CPU\n");
is_core2 = 0;
break;

case 61:
case 71: /* bdw */
printf("Found Broadwell CPU\n");
is_core2 = 0;
break;

case 86:
case 79: /* bdw-DE/EP */
printf("Found Broadwell-DE/EP CPU\n");
is_core2 = 0;
break;

case 78:
case 94: /* Skylake */
printf("Found Skylake CPU\n");
is_core2 = 0;
break;

case 85: /* Skylake / Cascade Lake Server*/
printf("Found Skylake / Cascadelake Server CPU\n");
is_core2 = 0;
break;

case 142:
case 158: /* Kabylake */
printf("Found Kabylake CPU\n");
is_core2 = 0;
break;

/* Core 2 */

case 15:
case 22:
case 23:
case 29: /* core2 */
printf("Found Core2 CPU\n");
is_core2 = 1;
break;
break;
default:
printf("Unsupported model %d\n", model);
is_core2 = -1;
break;
}

return is_core2;
}

/* Enable prefetch on nehalem and newer */
static int show_prefetch_nhm(int core) {

int fd;
int result;
int begin, end, c;

printf("Show all prefetch\n");

if (core == -1) {
begin = 0;
end = 1024;
} else {
begin = core;
end = core;
}

for (c = begin; c <= end; c++) {

fd = open_msr(c);
if (fd < 0)
break;

/* Read original results */
result = read_msr(fd, NHM_PREFETCH_MSR);

printf("\tCore %d old : L2HW=%c L2ADJ=%c DCU=%c DCUIP=%c\n", c,
result & 0x1 ? 'N' : 'Y', result & 0x2 ? 'N' : 'Y',
result & 0x4 ? 'N' : 'Y', result & 0x8 ? 'N' : 'Y');

close(fd);
}

return 0;
}

int main(int argc, char **argv) {

auto c = detect_cpu();
if (c < 0) {
printf("Unsupported CPU type\n");
return -1;
}

show_prefetch_nhm(0);

return 0;
}