Database Collation Optimization In TiDB | TiFlash

Posted on 2022-11-26 Edited on 2023-01-20 In PingCAP

In database systems, Collation specifies how data is sorted and compared in a database. Collation provides the sorting rules, case, and accent sensitivity properties for the data in the database.

在数据库系统中，CHARACTER SET(aka 字符集) 为一组字符和编码的集合，COLLATION 则表示字符排序|比较规则、大小写敏感等属性。参考 Character Sets and Collations in MySQL，MySQL 支持多种字符集，每种字符集包含多种 collation（默认使用一种）。TiDB 体系中默认采用字符集 utf8mb4，collation utf8mb4_bin。对于 TiFlash 这样的 OLAP 引擎而言，字符集相关规则引入的额外抽象必然会对性能产生较大影响，本文主要阐述优化 collation 的过程并在 Benchmark TPCH 和 ClickBench 上验证效果。

需注意的点：

CPU Cache
SIMD
- 优化 memcpy
- 优化 memcmp,memequal,memchr,strstr
- 核心原则
  - 越是用高级 SIMD 指令，越是要优先考虑小数据处理的性能
  - 当 CPU 成为瓶颈时，高级指令才能带来可观的性能提升
  - 内存对齐较为关键
Devirtualization

Benchmark

Overall

TPCH-100：总体性能提升约 5.55% （结合 LTO）
ClickBench：对于 TiFlash 已支持的查询，总体性能提升约 21.06%
Ossinsight（PingCAP 内部测试）：所用的字符串比较过滤等类似场景，性能提升约 46.58%～75.76%
当前优化主要面向 BIN COLLATION，并针对 TiDB 默认的 UTF8MB4_BIN / UTF8_BIN 特殊优化
- 注意：ClickBench 数据集中，URL、Title、Referer、SearchPhrase、OriginalURL、OpenstatCampaignID、UTMCampaign、UTMContent 列均包含以空格（0x20）结尾的数据。按照当前 TiDB 体系的行为需裁剪掉末尾的空格。处理到这类数据时必然产生额外开销，优化效果一般。
本轮优化不包含 CI COLLATION

TPCH-100

One TiFlash Store

TIFLASH x 1
- memory limit in bytes: 207405139968
- cpu cores quota: 40
TIFLASH REPLICA x 2
- ALTER TABLE {...} COMPACT TIFLASH REPLICA
set @@tidb_enforce_mpp = on
同时对比测试移除 LTO 编译优化（自 TiDB v6.1.0 引入）后的影响；PGO 优化当前暂无自动使用；
测试程序 go-tpc/query.go · pingcap/go-tpc

Time(s)	Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865	Optimized	Improvement: (Original) / (Optimized) - 1.0		Original + No LTO (Link Time Optimization)	Improvement: (Original + No LTO) / (Optimized) - 1.0
Q1	9.09	8.42	7.96%	AGG() by multi STR; COLLATION;	10	18.76%
Q2	2.45	2.38	2.94%		2.52	5.88%
Q3	5.6	5.47	2.38%		5.6	2.38%
Q4	6.14	6.07	1.15%		6.24	2.80%
Q5	13.52	13.05	3.60%		13.52	3.60%
Q6	1.98	1.98	0.00%		2.01	1.52%
Q7	6.34	6.14	3.26%		6.51	6.03%
Q8	8.69	8.36	3.95%		8.93	6.82%
Q9	38.42	38.49	-0.18%		38.82	0.86%
Q10	6.95	6.61	5.14%		7.58	14.67%
Q11	1.64	1.58	3.80%		1.71	8.23%
Q12	4.4	4.26	3.29%		4.46	4.69%
Q13	8.42	7.82	7.67%	LIKE(); COLLATION;	8.42	7.67%
Q14	2.11	2.11	0.00%		2.21	4.74%
Q15	4.46	4.46	0.00%		4.73	6.05%
Q16	2.25	2.11	6.64%	LIKE(); COLLATION;	2.28	8.06%
Q17	13.32	12.78	4.23%		13.32	4.23%
Q18	18.09	17.41	3.91%		18.49	6.20%
Q19	5.54	4.66	18.88%	COLLATION	5.6	20.17%
Q20	2.99	2.92	2.40%		3.02	3.42%
Q21	24.73	24.26	1.94%		25.4	4.70%
Q22	1.85	1.78	3.93%		1.91	7.30%
SUM	188.98	183.12	3.20%		193.28	5.55%

Two TiFlash Store

TIFLASH x 2
- memory limit in bytes: 207405139968
- cpu cores quota: 40
TIFLASH REPLICA x 2
- ALTER TABLE {...} COMPACT TIFLASH REPLICA
set @@tidb_enforce_mpp = on

Time(s)	Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865	Optimized	Improvement: (Original) / (Optimized) - 1.0
Q1	5.4	4.97	8.65%
Q2	1.91	1.88	1.60%
Q3	4.46	4.4	1.36%
Q4	6.61	6.48	2.01%
Q5	11.54	11.11	3.87%
Q6	1.01	1.01	0.00%
Q7	4.9	4.73	3.59%
Q8	8.36	8.25	1.33%
Q9	30.97	30.06	3.03%
Q10	5.87	5.4	8.70%
Q11	1.51	1.44	4.86%
Q12	2.55	2.52	1.19%
Q13	5.57	5.34	4.31%
Q14	1.17	1.17	0.00%
Q15	2.18	2.21	-1.36%
Q16	1.24	1.21	2.48%
Q17	9.5	9.63	-1.35%
Q18	12.82	12.75	0.55%
Q19	2.99	2.52	18.65%
Q20	2.15	2.11	1.90%
Q21	16.74	16.27	2.89%
Q22	1.01	1.01	0.00%

ClickBench

TIFLASH x 1
- memory limit in bytes: 207405139968
- cpu cores quota: 40
TIFLASH REPLICA x 2
- ALTER TABLE {...} COMPACT TIFLASH REPLICA
Data source: ClickBench

Time(s)	Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865	Optimized		Improvement: (Original) / (Optimized) - 1.0	Use collation: Y (yes)
Q1	0.276	0.277		-0.36%
Q2	0.029	0.0301		-3.65%
Q3	0.0675	0.0653		3.37%
Q4	0.1813	0.1788		1.40%
Q5	2.285	2.255		1.33%
Q6	1.46	1.36		7.35%	Y	优化单 STR 列 GROUP BY
Q7	0.1689	0.1693		-0.24%
Q8	0.0391	0.0381		2.62%
Q9	1.205	1.175		2.55%
Q10	2.005	2		0.25%
Q11	0.2722	0.2521		7.97%	Y	短字符串比较过滤；优化 MEM UTILS 基础函数；
Q12	0.2936	0.2731		7.51%	Y	同 Q11；优化多 STR 列 GROUP BY；
Q13	1.03	0.9916		3.87%	Y
Q14	1.96	1.87		4.81%	Y
Q15	1.105	1.06		4.25%	Y
Q16	1.08	1.025		5.37%
Q17	3.475	3.36		3.42%	Y
Q18	2.865	2.77		3.43%	Y
Q19	0	0	ERROR: Out Of Memory Quota!extract 无法下推 TiFlash
Q20	0.5935	0.5773		2.81%
Q21	3.45	0.8726		295.37%	Y	LIKE 表达式相关：优化字符串搜索算法；avx2 指令优化相关 MEM UTILS 基础函数；
Q22	3.57	1.0151		251.69%	Y
Q23	6.645	1.815		266.12%	Y
Q24	6.665	4.945	SELECT * … ORDER BY … LIMIT 10;当前需要读全表数据，耗时占比较大，对于 LIMIT 数量较小的场景可延迟物化（先从TiFlash 获取主键，再读 TiKV）；	34.78%	Y
Q25	0.4399	0.3675		19.70%	Y	同 Q11
Q26	0.2029	0.1737		16.81%	Y	同 Q11
Q27	0.4221	0.3731		13.13%	Y	同 Q11；优化多 key 排序；
Q28	1.345	1.305		3.07%	Y
Q29	0	0	ERROR: Out Of Memory Quota!regexp_replace 无法下推 TiFlash		Y
Q30	9.655	9.54		1.21%
Q31	0.8385	0.7974		5.15%	Y
Q32	1.195	1.185		0.84%	Y
Q33	6.98	6.915		0.94%
Q34	6.16	5.945		3.62%	Y
Q35	6.115	5.815		5.16%	Y
Q36	1.385	1.37		1.09%
Q37	0.2158	0.2122		1.70%	Y
Q38	0.1363	0.1328		2.64%	Y
Q39	0.1134	0.1071		5.88%	Y
Q40	0.4411	0.4261		3.52%	Y
Q41	0.0754	0.0746		1.07%
Q42	0.0572	0.0565		1.24%
Q43	0.1397	0.1341		4.18%
SUM	76.6384	63.3055		21.06%

[TBD] ClickBench Enhancement

Optimize Aggregation In ClickBench

Q11
- SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;
- GROUP BY: STR(MobilePhoneModel), INT(UserID)
Q12
- SELECT MobilePhone, MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10;
- GROUP BY: INT(MobilePhone), STR(MobilePhoneModel), INT(UserID)
Q14
- SELECT SearchPhrase, COUNT(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10;
- GROUP BY: STR(SearchPhrase), INT(UserID)
Q15
- SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;
- GROUP BY: INT(SearchEngineID), STR(SearchPhrase)
Q17, Q18
- GROUP BY: INT(UserID), STR(SearchPhrase)
Q23
- SELECT SearchPhrase, MIN(URL), MIN(Title), COUNT(*) AS c, COUNT(DISTINCT UserID) FROM hits WHERE Title LIKE '%Google%' AND URL NOT LIKE '%.google.%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
- GROUP BY: INT(UserID), STR(SearchPhrase)
- GROUP BY 非性能瓶颈
Q34, Q35
- SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10;
- 经典的 top-n url 问题，TiFlash 的执行计划会按照多节点 Exchange 后聚合汇总。如果仅考虑单节点，则可进一步缩短链路并优化。
Q40
- SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
- GROUP BY: TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst

延迟物化

Q24：SELECT * ... ORDER BY ... LIMIT ... 当前需要读全表数据，耗时占比较大，对于 LIMIT 数量较小的场景可延迟物化（先从TiFlash 获取主键，再读 TiKV）
改写为 select * from hits where _tidb_rowid in (select _tidb_rowid FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10);
- 查询耗时从 5.07s 变成 1.28s
- 性能提升 296.09%

JIT

主要面向各类 AGG、JOIN 等计算任务

Ossinsight（PingCAP 内部测试）

模拟类似 Ossinsight 用到的 SQL，使用 TPCH-100 环境
本轮优化不包含 CI COLLATION，对于 Ossinsight TiFlash 中使用 UNICODE CI COLLATION 的场景收效有限
- 事实情况是 Ossinsight 误用了 UNICODE COLLATION，本应该使用 UTF8 BIN 相关 COLLATION

Time(s)	Original: Original: rollback all PR in pingcap/tiflash#5294 from commit a0f9865	Optimized	Improvement: (Original) / (Optimized) - 1.0
Ossinsight 类似场景：字符串比较过滤（tpch-100 数据集模拟）
select count(1) from lineitem where L_SHIPMODE = ‘zzzz’;	1.07	0.73	46.58%	优化短字符串比较过滤 varchar(utf8mb4_bin): const-char(utf8mb4_bin)
select count(1) from lineitem where L_RETURNFLAG = ‘R’;	1.16	0.66	75.76%

字符串排序
select min(L_SHIPMODE) from lineitem;	0.93	0.76	22.37%	avx2 指令优化基础函数 memcmp
select max(L_SHIPMODE) from lineitem;	1.11	0.84	32.14%

优化过程

Improve The Performance Of New Collation Releated Function And Executor

优化 Multi Key Sort

tiflash#5908

关键点在于去虚拟化。单 Key 处理比较容易，多 Key 场景则要么 JIT，要么手动展开。此处用模板，当 Key 数量为 2 时，对常见的几个类型进行展开处理。
- UInt64
- Int64
- StringBin
- StringBinPadding
- StringWithCollatorGeneric
TODO：对于 Key 数量大于等于 3 的场景，这种方法模板膨胀较严重，可能需要按需求设置快速路径。

Benchmark

tpch-10/tpch-100/clickbench
tiflash x 1
original: 6d0cbc8
data: clickbench
limit cpu up to 200%

SQL
- SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime, SearchPhrase LIMIT 10;
sort key 为 UINT64,STR(utf8 collator)
sort 开销占比较小

Time(s)	Original	Optimized		Improvement
	4.56	4.42
	4.54	4.53
	4.49	4.34
	4.56	4.48
	4.56	4.42		AVG(Original) / AVG(Optimized) - 1.0
AVG	4.542	4.438	Optimized : Original	2.34%

limit cpu up to 2000%
SQL
- SELECT ROW_NUMBER() OVER w1 FROM PART window w1 AS (PARTITION BY P_MFGR order by P_SIZE);
sort key 为 STR(utf8 collator),UINT64
sort 开销占比较大

Time(s)	Original	Optimized		Improvement
	8.16	7.18
	8.22	6.94
	8.29	7.35
	8.34	7.04
	8.24	7.37		AVG(Original) / AVG(Optimized) - 1.0
AVG	8.25	7.176	Optimized : Original	14.97%

limit cpu up to 2000%
SQL
- EXPLAIN analyze SELECT ROW_NUMBER() OVER w1 FROM PART window w1 AS (PARTITION BY p_name ORDER BY p_partkey);
sort key 为 INT64,STR(utf8 collator)
sort 开销占比约 30%

Time(s)	Original	Optimized		Improvement
	9.71	9.28
	9.62	9.41
	9.76	9.25
	9.73	9.26
	9.67	9.28		AVG(Original) / AVG(Optimized) - 1.0
AVG	9.698	9.296	Optimized : Original	4.32%

limit cpu up to 2000%
SQL

1
2
3

EXPLAIN analyze SELECT ROW_NUMBER() OVER w1 FROM PART window w1 AS (PARTITION BY P_MFGR ORDER BY p_name);
|       └─Sort_13                    | 20000000.00 | 20000000 | mpp[tiflash] |               | tiflash_task:{time:17.6s, loops:309, threads:8}                                                               | tpch100_new.part.p_mfgr, tpch100_new.part.p_name, stream_count: 8                                                                                             | N/A      | N/A  |
|         └─ExchangeReceiver_12      | 20000000.00 | 20000000 | mpp[tiflash] |               | tiflash_task:{time:7.18s, loops:1254, threads:8}                                                              | stream_count: 8                                                                                                                                               | N/A      | N/A  |

sort key 为 STR(utf8 collator),STR(utf8 collator)
sort 开销占比较大

Time(s)	Sort end	Sort start	Original: sort_end - sort_start	Sort end	Sort start	Optimized: sort_end - sort_start		Improvement
	17.6	7.18	10.42	15.6	6.15	9.45
	16.9	6.58	10.32	15.2	5.82	9.38
	17.8	7.07	10.73	15.7	5.95	9.75
	17.1	6.88	10.22	15.9	5.87	10.03
	18.8	8.04	10.76	16	6.07	9.93		AVG(Original) / AVG(Optimized) - 1.0
AVG			10.49			9.708	Optimized : Original	8.06%

Time(s)	Original	Optimized		Improvement
	18.3	16.34
	18.27	16.62
	18.7	16.57
	18.57	16.52
	18.55	16.63		AVG(Original) / AVG(Optimized) - 1.0
AVG	18.478	16.536	Optimized : Original	11.74%

优化 Aggregation And Join

tiflash#6135

tiflash#5834

关键点在于去虚拟化。TiFlash 和 Clickhouse 代码中已针对多种场景手动展开：
- 常见类型的单 Agg/Join Key，多整型数 Key，部分 Nullable Key，等。
复杂多 Key 场景则要么 JIT，要么手动展开。Clickhouse 代码中针对 Agg 场景已有 JIT 支持，默认不开启。
此处用模板，当 Key 数量小于等于 2 时，对常见的几个类型进行展开处理
- one string key
  - type: str-bin, str-bin-padding
  - struct AggregationMethodOneKeyStringNoCache
- two keys
  - type: number-64-bytes, str-bin, str-bin-padding
  - struct AggregationMethodFastPathTwoKeysNoCache
TODO：对于 Key 数量大于等于 3 的场景，这种方法模板膨胀较严重，可能需要按需求设置快速路径。

Benchmark

limit cpu up to 2000%
TiFlash x 1
TIFLASH REPLICA x 1
original commit: 49ca973

Time(s)	Original	Optimized	Improvement: (Original) / (Optimized) - 1.0
ClickBench
Q14	2.075	1.835	13.08%
Q15	1.21	1.155	4.76%
Q17	3.695	3.365	9.81%
Q18	2.925	2.805	4.28%
Q34	6.4	6.085	5.18%
Q35	6.575	6.28	4.70%
TPCH-100
Q1	7.28	7.08	2.82%

TiFlash x 1
Data: tpch-10
original a8c8cb1
limit cpu up to 500%
SQL select max(l_comment) from lineitem;

Time(s)	Original	Optimized		Improvement
	7.71	7.05
	7.75	6.96
	7.95	7.07
	7.61	7.23
	7.83	7.07		AVG(Original) / AVG(Optimized) - 1.0
AVG	7.77	7.076	Optimized : Original	9.81%

Data: tpch-100
limit cpu up to 2000%
SQL: tpch Q1

Time(s)	Original	Optimized		Improvement
	11.48	10.68
	10.96	10.5
	11.18	10.66
	11.16	10.67
	11.18	10.51		AVG(Original) / AVG(Optimized) - 1.0
AVG	11.192	10.604	Optimized : Original	5.55%

CPU Cache 踩坑

tiflash#5996

内部测试中存储模块出现性能下降 tiflash#5949，其根因在于 commit dfac6a5 将默认 memcpy 函数替换为 __folly_memcpy。在 commit 的 benchmark 中，可知 __folly_memcpy 在小数据下（size < 80 B）性能不如之前的实现，但对于存储而言，这类小数据拷贝在全局所占比例不高，理论上不应该产生如此巨大的负面影响。

Non-temporal Memory Copy

分析代码可知，__folly_memcpy 定义了一个阈值 NON_TEMPORAL_STORE_THRESHOLD（32768 即 32K），其作用在于如果需要拷贝的内存大于阈值且双方地址均已对齐，就利用 NON-TEMPORAL 的方式来减少对 CPU Cache 的污染。

folly/memcpy.S#L329-L353

现代 CPU 普遍采用多级缓存，以 AMD Ryzen 9 5900X 为例，其在 WSL2 中参数信息如下：

getconf -a | grep CACHE
LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC                8
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 32768
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  524288
LEVEL2_CACHE_ASSOC                 8
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  67108864
LEVEL3_CACHE_ASSOC                 0
LEVEL3_CACHE_LINESIZE              64
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 0
LEVEL4_CACHE_LINESIZE              0

通常每个 CPU 含有独立的 L1 和 L2 缓存，L3 缓存为所有核共享。L1 缓存分为 L1i（存储指令）和 L1d（存储数据），L2 和 L3 缓存不区分指令和数据。参考 7-cpu.com/cpu/Zen2 和 amd/microarchitectures/zen_3，缓存数据延迟约为：

L1 Data Cache Latency:
4 cycles for simple access via pointer
5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).

AMD DOC: 7- or 8-cycle FPU load-to-use latency.

L2 Cache Latency = 12 cycles

L3 Cache Latency = 38 cycles

RAM Latency = 38 cycles + 66 ns

绝大多数情况下，多级缓存机制可以显著增强读写内存数据的性能。但对于已知的 2 种场景而言，刷新 CPU 缓存的意义不大：

临时内存拷贝，即内存数据短期不再被用到
大块内存拷贝，需要拷贝的内存大小超过 CPU 缓存，容易导致其他模块的有效缓存被刷掉，造成缓存污染

glibc 的实现也考虑到了这一点，例如在 Debian GLIBC 2.28-10 中，memcpy 函数的一处逻辑会判断需拷贝的内存大小，超过 60MB(mallwatch@@GLIBC_2.2.5+0x8) 则用 movntdq 指令将数据从寄存器写到内存。该类指令采用 Non-temporal Hint 来避免 CPU 缓存数据。与之对应的 movntdqa 指令则是将数据从内存读取到寄存器。

/lib/x86_64-linux-gnu/libc.so.6

GNU C Library (Debian GLIBC 2.28-10) stable release version 2.28.
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 8.3.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.

# Debian GNU/Linux 10.9
objdump -r -C -D /lib/x86_64-linux-gnu/libc.so.6

   a2232:       48 3b 15 3f e2 11 00    cmp    0x11e23f(%rip),%rdx        # 1c0478 <mallwatch@@GLIBC_2.2.5+0x8>
   a2239:       0f 87 cc 00 00 00       ja     a230b <memcpy@GLIBC_2.2.5+0x2eb>
   ...
   a230b:       4c 8d 14 17             lea    (%rdi,%rdx,1),%r10
   a230f:       4c 39 d6                cmp    %r10,%rsi
   a2312:       0f 82 27 ff ff ff       jb     a223f <memcpy@GLIBC_2.2.5+0x21f>
   a2318:       0f 18 8e 80 00 00 00    prefetcht0 0x80(%rsi)
   a231f:       0f 18 8e c0 00 00 00    prefetcht0 0xc0(%rsi)
   a2326:       0f 10 06                movups (%rsi),%xmm0
   a2329:       0f 10 4e 10             movups 0x10(%rsi),%xmm1
   a232d:       0f 10 56 20             movups 0x20(%rsi),%xmm2
   a2331:       0f 10 5e 30             movups 0x30(%rsi),%xmm3
   a2335:       48 83 c6 40             add    $0x40,%rsi
   a2339:       48 83 ea 40             sub    $0x40,%rdx
   a233d:       66 0f e7 07             movntdq %xmm0,(%rdi)
   a2341:       66 0f e7 4f 10          movntdq %xmm1,0x10(%rdi)
   a2346:       66 0f e7 57 20          movntdq %xmm2,0x20(%rdi)
   a234b:       66 0f e7 5f 30          movntdq %xmm3,0x30(%rdi)
   a2350:       48 83 c7 40             add    $0x40,%rdi
   a2354:       48 83 fa 40             cmp    $0x40,%rdx
   a2358:       77 be                   ja     a2318 <memcpy@GLIBC_2.2.5+0x2f8>
   a235a:       0f ae f8                sfence 
   a235d:       0f 11 29                movups %xmm5,(%rcx)
   a2360:       0f 11 71 f0             movups %xmm6,-0x10(%rcx)
   a2364:       0f 11 79 e0             movups %xmm7,-0x20(%rcx)
   a2368:       44 0f 11 41 d0          movups %xmm8,-0x30(%rcx)
   a236d:       41 0f 11 23             movups %xmm4,(%r11)
   a2371:       c3                      retq

// test.1.cpp
#include <cstdint>
#include <cstdio>
extern uint64_t mallwatch;
int main() {
  // mallwatch@@GLIBC_2.2.5+0x8
  printf("%.2f MB\n", (&mallwatch)[1] / 1024.0 / 1024.0);
  return 0;
}

1 2	filename=`mktemp` && clang++ -fPIC test.1.cpp -o ${filename} && ${filename} && rm ${filename} 60.00 MB

由于内存相对 CPU 缓存的数据延迟存在数量级差距，对于马上会被复用的内存数据，使用 NON-TEMPORAL 的方式进行拷贝无疑会严重拖慢性能。尤其是对于 OLAP 类的引擎而言，大块内存的使用较为频繁，那么该如何选择启用 NON-TEMPORAL 的时机？

服务器端虚拟化 CPU 常见缓存大小为 L1 32KB，L2 256KB，L3 25MB～64MB
folly 设置了固定的阈值 32KB
Debian GLIBC 2.28-10 用全局变量作为阈值，进程初始化时设置，大约是 L3 缓存
- 其他版本也有采用 L1 缓存相关大小作为阈值，不同环境具体实现差异较大
TiFlash 默认不用这种方式
- 对于大部分存储|计算场景，最小数据单元为 Block（通常包含 8192 行数据）。即便是单 int64 类型列，大小也至少有 8192 * 8 = 64KB。如果是 str 类型列，每行至少还有 2 个 uint64 字段表示 offset 和 size，则总量大于 8192 * (8 + 8 + 1) = 136KB。
- 在 memcpy 函数中启用 NON-TEMPORAL 的合理阈值至少是大于 L2 缓存，具体得根据业务形态调整
- 实际场景中，如果需要进行大块临时内存拷贝，最好在逻辑实现上直接选择 NON-TEMPORAL 的实现而非调用 memcpy

优化 memcpy

tiflash#6281

folly 纯汇编实现的 __folly_memcpy 主要使用 avx2 指令优化，然而当拷贝数据量大于 32K(L1 cache size) 时表现较差，具体原因见上文 CPU Cache 踩坑。旧版实现 sse2_inline_memcpy 则主要基于 sse2。
该 PR 旨在提供一种更高效的 inline_memcpy 实现。

avx2_inline_memcpy

inline 实现需保证代码不过于复杂，尽可能紧凑，可牺牲些许性能（这类特定场景编译器生成的汇编码一般不如人工手写），否则不如用非 inline 的汇编实现。
为 size 不大于 32 字节的场景设置优先路径
- size 以 16，8，4，2 进行划分，对应寄存器的大小
- 为了 benchmark 结果好看，分支排列顺序可以优先大 size。本文的实现则是判断 8～16 优先于 16～32。
- 真实场景中，这部分排列所能影响的分支预测｜跳转开销占比相对较小
由于 memcpy 要求 src 和 dst 数据范围不可重叠，前后各按可用的寄存器大小进行拷贝
对于 size 大于 256 的场景，先将 dst 地址按照 32（ymm 寄存器大小）对齐
Software Prefetch（使用内存预取指令）对于已经在 CPU 缓存内的数据收益不大，读写连续的数据，现代 CPU 大多已有 Hardware Prefetch 来优化
- 参考 algorithmica/hpc：Software Prefetch 配合 non-temporal 读写的场景较为有用

Benchmark of `avx2_inline_memcpy`

Run on (40 X 2386.24 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 256 KiB (x20)
  L3 Unified 25600 KiB (x2)
Load Average: 6.72, 6.65, 6.07
-------------------------------------------------------------------------------------------------------------------
Benchmark                                                                         Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------
MemUtilsCopy_1_20_3_true_20000/stl_mempy/iterations:500                      446194 ns       446197 ns          500
MemUtilsCopy_1_20_3_true_20000/inline_clickhouse_memcpy/iterations:500       140571 ns       140528 ns          500
MemUtilsCopy_1_20_3_true_20000/sse2_memcpy/iterations:500                    150496 ns       150321 ns          500
MemUtilsCopy_1_20_3_true_20000/avx2_memcpy/iterations:500                    137616 ns       137586 ns          500
MemUtilsCopy_1_20_3_true_20000/folly_memcpy/iterations:500                   181997 ns       182004 ns          500
MemUtilsCopy_1_40_3_true_20000/stl_mempy/iterations:500                      456534 ns       456549 ns          500
MemUtilsCopy_1_40_3_true_20000/inline_clickhouse_memcpy/iterations:500       176257 ns       176264 ns          500
MemUtilsCopy_1_40_3_true_20000/sse2_memcpy/iterations:500                    178094 ns       178086 ns          500
MemUtilsCopy_1_40_3_true_20000/avx2_memcpy/iterations:500                    165250 ns       165256 ns          500
MemUtilsCopy_1_40_3_true_20000/folly_memcpy/iterations:500                   186537 ns       186544 ns          500
MemUtilsCopy_1_80_3_true_20000/stl_mempy/iterations:500                      476534 ns       476510 ns          500
MemUtilsCopy_1_80_3_true_20000/inline_clickhouse_memcpy/iterations:500       238987 ns       238996 ns          500
MemUtilsCopy_1_80_3_true_20000/sse2_memcpy/iterations:500                    302893 ns       302887 ns          500
MemUtilsCopy_1_80_3_true_20000/avx2_memcpy/iterations:500                    231339 ns       231348 ns          500
MemUtilsCopy_1_80_3_true_20000/folly_memcpy/iterations:500                   200658 ns       200665 ns          500
MemUtilsCopy_1_200_3_true_20000/stl_mempy/iterations:500                     660466 ns       660356 ns          500
MemUtilsCopy_1_200_3_true_20000/inline_clickhouse_memcpy/iterations:500      440048 ns       439948 ns          500
MemUtilsCopy_1_200_3_true_20000/sse2_memcpy/iterations:500                   427743 ns       427739 ns          500
MemUtilsCopy_1_200_3_true_20000/avx2_memcpy/iterations:500                   198850 ns       198876 ns          500
MemUtilsCopy_1_200_3_true_20000/folly_memcpy/iterations:500                  243985 ns       243973 ns          500
MemUtilsCopy_1_2000_3_true_20000/stl_mempy/iterations:500                   1535333 ns      1535242 ns          500
MemUtilsCopy_1_2000_3_true_20000/inline_clickhouse_memcpy/iterations:500     847673 ns       847547 ns          500
MemUtilsCopy_1_2000_3_true_20000/sse2_memcpy/iterations:500                  793617 ns       793608 ns          500
MemUtilsCopy_1_2000_3_true_20000/avx2_memcpy/iterations:500                  682924 ns       682915 ns          500
MemUtilsCopy_1_2000_3_true_20000/folly_memcpy/iterations:500                 662669 ns       662652 ns          500

当内存拷贝的数据量较大（例如大于 L3 Cache Size x 2），则内存 I/O 成为瓶颈，此时各类实现性能差异不大

优化 memcmp memequal memchr strstr

tiflash#5658

基于 avx2 实现 avx2_mem_cmp 和 avx2_mem_equal
- 原本基于 avx512 的 mem_utils::memoryEqual 面向小字符串比较性能比 std::memcmp 更慢，重新基于 avx2 实现
基于 avx2 实现字符串搜索相关基础函数 avx2_memchr 和 avx2_strstr

Implementation Of Mem Utils

avx2_mem_cmp
avx2_mem_equal

为 size 不大于 32 的场景设置优先路径
- 本文的实现采用了较为简单的 switch case 方式，也可以划分 size 并按需调整分支排列顺序
对于 size 大于 256 的场景，先将地址按照 32（ymm 寄存器大小）对齐
mem cmp 相较于 mem equal 稍微复杂，本质上都是利用 SIMD 指令按批判断数据是否相等，对于不等的情况，mem cmp 还需找到不等的字节并读取到 32 位寄存器中进行相减
处理连续内存，暂无需考虑预取

avx2_memchr
avx2_strstr

字符串直接匹配搜索的算法有很多，典型的类似 KMP / Aho–Corasick 之流会用到状态机。std::string_view::find 则是调用 memchr 找到目标的第一个字符，再通过 bcmp 判断子串是否相等并循环往复。
可以假设，通常情况下用于匹配的数据是有意义的且含有一定的特征，状态机算法引入额外的空间开销以及生成构建开销反而低效。
- 如果考虑非朴素场景，则可以另外进行针对性优化，例如搜索引擎所用的倒排索引。
avx2_strstr 实现上则是通过 avx2_memchr 找到第一个目标字符，再通过 avx2_mem_equal 判断相等，以此循环直到完全匹配
avx2_strstr.h#L195-L248 考虑到绝大部分场景下目标字符串都不大，所以此处实现上是用模板封装 size 不大于 16 的场景，进一步内联优化以减少分支
处理内存对齐需注意的点：avx2_strstr.h#L126-L159
- 内存分配器以 Page 为基准单元向操作系统申请并管理内存
  - Page 大小通常为 4KB，至少是一个 block（512B）
  - 如果内存地址 S 是合法的，则 [ALIGN_TO_PAGE_SIZE(S), ALIGN_TO_PAGE_SIZE(S) + PAGE_SIZE) 范围内均是合法地址
- 为减少分支，此处直接先将地址按 32 对齐并通过 avx2 指令计算标志位。对于多算的部分，则可根据对齐时的偏移去除。avx2_strstr.h#L35-L58 末尾处理也是同理。
avx2_strstr.h#L77-L116 按照 128 个字节做批处理，优先乐观检测过滤

Benchmark Of Mem Utils

参考 Benchmark Collation Impact 中的测试用例
original 8404e656

Time(ns)	Original	Optimized	Improvement: (Original) / (Optimized) - 1.0
CollationEqBench/UTF8MB4_BIN	12428711	6228798	99.54%
CollationEqBench/UTF8_BIN	12956705	6141843	110.96%
CollationEqBench/ASCII_BIN	12625723	6229335	102.68%
CollationEqBench/BINARY	11870078	5837615	103.34%
CollationEqBench/LATIN1_BIN	13768201	6732640	104.50%
CollationLikeBench/UTF8MB4_BIN	37940667	20185747	87.96%
CollationLikeBench/UTF8_BIN	37803575	19914106	89.83%
CollationLikeBench/ASCII_BIN	36860160	17999743	104.78%
CollationLikeBench/BINARY	37449881	17599053	112.79%
CollationLikeBench/LATIN1_BIN	37503432	17675036	112.18%

测试 STL 库中的 bcmp，mem_utils::memoryEqual 以及自定义 avx2_mem_equal 性能对比
测试 STL 库中的 std::string_view::find 和自定义 avx2_strstr 性能对比

Time(ns)	STL	Original-avx512	Optimized-avx2	Improvement: (STL) / (Optimized) - 1.0	Improvement: (Original) / (Optimized) - 1.0
check mem eq: MemUtilsEqual_${str-size}
MemUtilsEqual_13	4.46	10.3	3.21	38.94%	220.87%
MemUtilsEqual_65	5.25	9.83	4.44	18.24%	121.40%
MemUtilsEqual_100	9.31	11.3	5.32	75.00%	112.41%
MemUtilsEqual_10000	299	377	213	40.38%	77.00%
MemUtilsEqual_100000	3657	4009	3382	8.13%	18.54%
MemUtilsEqual_1000000	62265	53157	52600	18.37%	1.06%

str find: MemUtilsStrStr_${src-str-size}_${needle-str-size}
MemUtilsStrStr_1024_1	30882		21275	45.16%
MemUtilsStrStr_1024_7	34927		21279	64.14%
MemUtilsStrStr_1024_15	39364		23161	69.96%
MemUtilsStrStr_1024_31	40628		29435	38.03%
MemUtilsStrStr_1024_63	37381		26141	43.00%
MemUtilsStrStr_80_1	6130		3977	54.14%
MemUtilsStrStr_80_7	11720		6278	86.68%
MemUtilsStrStr_80_15	11585		5423	113.63%
MemUtilsStrStr_80_31	11467		9530	20.33%

测试 STL 库中 memcmp 同自定义 avx2_mem_cmp 对比

Time(ns)	STL: (GNU libc) 2.17	Optimized-avx2	Improvement: (STL) / (Optimized) - 1.0
`MemUtilsCmp_${str-size}_${loop_times}: check mem-cmp for str for specific times`
MemUtilsCmp_2_20	66.5	51.9	28.13%
MemUtilsCmp_13_20	75.3	66	14.09%
MemUtilsCmp_65_20	126	106	18.87%
MemUtilsCmp_100_20	167	106	57.55%
MemUtilsCmp_10000_20	5145	3740	37.57%
MemUtilsCmp_100000_20	81996	68577	19.57%
MemUtilsCmp_1000000_20	1254279	1112721	12.72%

测试分别使用 avx2_strstr 和 std::string_view::find 的 LIKE() 表达式计算性能

1	select count(1) from orders where o_comment like '%pending%deposits%';

Time(s)	Original	Optimized		Improvement
	10.75	8.72
	10.92	8.87
	10.98	8.35
	10.7	8.64
	10.77	8.5		AVG(Original) / AVG(Optimized) - 1.0
AVG	10.824	8.616	Optimized : Original	25.63%

优化字符串比较

字符串比较

tiflash#5299

早期版本的实现为 collation 处理引入了虚函数 compare / sortKey 用于字符串比较以及获取排序键。这种方式面向大计算量的场景比较低效。
此处针对 TiDB 默认的 BIN 系列 collation，在处理 str 列时去虚拟化。

Benchmark

Data: tpch-10
tiflash x 1
limit cpu up to 200%

MySQL [tpch_10]> select count(1) from lineitem where l_comment = 'zzle? slyly regular instruc       ';
+----------+
| count(1) |
+----------+
|        1 |
+----------+

Time(s)	Original	Optimized	NoCollation		Improvement
	2.46	1.56	1.48
	2.41	1.56	1.38
	2.36	1.47	1.43
	2.35	1.55	1.4	NoCollation : Original	69.30%
	2.44	1.49	1.41	Optimized : Original	57.54%
AVG	2.404	1.526	1.42	NoCollation : Optimized	7.46%

SQL select count(1) from lineitem where l_comment < l_comment;

Time(s)	Original	Optimized	NoCollation		Improvement
	2.32	2.07	1.91
	2.27	2.06	1.95
	2.33	2.07	2.05
	2.33	2.09	1.99	NoCollation : Original	15.49%
	2.23	2.08	2.04	Optimized : Original	10.70%
AVG	2.296	2.074	1.988	NoCollation : Optimized	4.33%

constant 字符串比较

tiflash#5569

重点优化 str 列同 constant str 比较 select ... from ... where xxx = 'xxx' ...。例如 select count(*) from github_events where actor_login != 'zzzzzzz'。
CollationOperatorOptimized.h#L261-L290 面向 size 为 [0, 16] 的 constant str 利用模板展开计算

Benchmark

tpch-100
tiflash x 1
limit cpu up to 200%
original commit: 30fc64c
SQL: select count(1) from lineitem where L_SHIPMODE = 'zzzz';

Time(s)	Original	Optimized		Improvement
	9.15	7.52
	9.33	7.62
	9.12	7.58
	9.23	7.57
	9.14	7.65		AVG(Original) / AVG(Optimized) - 1.0
AVG	9.194	7.588	Optimized : Original	21.16%

SQL 参考 tpch-q10: select count(1) from lineitem where L_RETURNFLAG = 'R';

Time(s)	Original	Optimized		Improvement
	12.85	8.56
	12.87	8.64
	12.86	8.45
	12.75	8.51
	12.76	8.64		AVG(Original) / AVG(Optimized) - 1.0
AVG	12.818	8.56	Optimized : Original	49.74%

优化字符串搜索

tiflash#5489

早期实现 Collator.cpp#L71-L172 是将字符逐个按照 collation 解析出来，再按照状态机匹配。其中包含太多虚函数调用，而且算法较为低效。
重点优化 BIN COLLATION 处理表达式 LIKE() ESCAPE() 的行为 CollationStringSearchOptimized.h#L28-L419
- utf8 字符有明确的前缀编码，binary 更为简单，仅表示二进制字符。对于大小写敏感的场景，可以直接将 pattern 字符串进行划分，并按需进行字符串搜索（便于 SIMD 向量化处理）
- 承袭至 CK 的 ASCIICaseSensitiveStringSearcher 或 Volnitsky 主要面向搜索引擎，在小字符串的场景中效果不如 std::string_view::find()。Aho–Corasick 之流会用到复杂状态机，根据上文分析，效果估计会更差。
- 此 PR 实现上先用了 std::string_view::find()，后续利用 avx2_strstr 优化后在此基础上还有 25.63% 的性能提升。

Benchmark

tpch-100
tiflash x 1
limit cpu up to 200%
original commit: a476307
SQL: select count(1) from orders where o_comment like '%pending%deposits%';

Time(s)	Original	Optimized		Improvement
	35.77	11
	33.84	10.88
	35.32	11.11
	34.82	11.21
	34.94	10.97		AVG(Original) / AVG(Optimized) - 1.0
AVG	34.938	11.034	Optimized : Original	216.64%

优化字符串排序

tiflash#5375

面向 BIN COLLATION 去虚拟化
- TiDB 中 padding 的实现方式与 MySQL 的不同。在 MySQL 中，padding 是通过补齐空格实现的。而在 TiDB 中 padding 是通过裁剪掉末尾的空格来实现的。两种做法在绝大多数情况下是一致的，唯一的例外是字符串尾部包含小于空格 (0x20) 的字符时，例如 'a' < 'a\t' 在 TiDB 中的结果为 1，而在 MySQL 中，其等价于 'a ' < 'a\t'，结果为 0。
假设大部分场景中字符串均不含末尾空格，乐观检测并执行快速路径 CollatorUtils.h#L55-L61

Benchmark

tpch-100
tiflash x 1
limit cpu up to 200%
original commit: 97342db
SQL select min(L_SHIPMODE) from lineitem;

Time(s)	Original	Optimized	NoCollation		Improvement
	11.35	9.98	9.81
	11.33	9.98	9.88
	11.23	10	9.84
	11.22	9.82	9.73	NoCollation : Original	15.22%
	11.58	9.95	9.96	Optimized : Original	14.04%
AVG	11.342	9.946	9.844	NoCollation : Optimized	1.04%

SQL select max(L_SHIPMODE) from lineitem;

Time(s)	Original	Optimized	NoCollation		Improvement
	13.56	12.62	12.77
	13.74	12.51	12.27
	13.35	12.61	12.32
	13.63	12.63	12.45	NoCollation : Original	9.13%
	13.52	12.66	12.32	Optimized : Original	7.57%
AVG	13.56	12.606	12.426	NoCollation : Optimized	1.45%

Others

Hardware Prefetch

Wiki/Cache-Prefetching

Check Hardware Prefetch enabled: ref deater/uarch-configure/intel-prefetch

/* Disable the hardware prefetcher on:        */
/* Core2             */
/* Nehalem, Westmere, SandyBridge, IvyBridge, Haswell and Broadwell   */
/* See: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors */
/*            */
/* The key is MSR 0x1a4          */
/* bit 0: L2 HW prefetcher         */
/* bit 1: L2 adjacent line prefetcher        */
/* bit 2: DCU (L1 Data Cache) next line prefetcher      */
/* bit 3: DCU IP prefetcher (L1 Data Cache prefetch based on insn address) */
/*            */
/* This code uses the /dev/msr interface, and you'll need to be root.      */
/*            */
/* by Vince Weaver, vincent.weaver _at_ maine.edu -- 26 February 2016    */

#define CORE2_PREFETCH_MSR 0x1a0
#define NHM_PREFETCH_MSR 0x1a4

#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#include <sys/syscall.h>

static int open_msr(int core) {

  char msr_filename[BUFSIZ];
  int fd;

  sprintf(msr_filename, "/dev/cpu/%d/msr", core);
  fd = open(msr_filename, O_RDONLY);
  if (fd < 0) {
    if (errno == ENXIO) {
      fprintf(stderr, "rdmsr: No CPU %d\n", core);
      exit(2);
    } else if (errno == EIO) {
      fprintf(stderr, "rdmsr: CPU %d doesn't support MSRs\n", core);
      exit(3);
    } else {
      perror("rdmsr:open");
      fprintf(stderr, "Trying to open %s\n", msr_filename);
      exit(127);
    }
  }

  return fd;
}

static long long read_msr(int fd, int which) {

  uint64_t data;

  if (pread(fd, &data, sizeof data, which) != sizeof data) {
    perror("rdmsr:pread");
    exit(127);
  }

  return (long long)data;
}

/* FIXME: should really error out if not an Intel CPU */
static int detect_cpu(void) {

  FILE *fff;

  int family, model = -1;
  char buffer[BUFSIZ], *result;
  char vendor[BUFSIZ];
  int is_core2 = -1;

  fff = fopen("/proc/cpuinfo", "r");
  if (fff == NULL)
    return -1;

  while (1) {
    result = fgets(buffer, BUFSIZ, fff);
    if (result == NULL)
      break;

    if (!strncmp(result, "vendor_id", 8)) {
      sscanf(result, "%*s%*s%s", vendor);

      if (strncmp(vendor, "GenuineIntel", 12)) {
        printf("%s not an Intel chip\n", vendor);
        return -1;
      }
    }

    if (!strncmp(result, "cpu family", 10)) {
      sscanf(result, "%*s%*s%*s%d", &family);
      if (family != 6) {
        printf("Wrong CPU family %d\n", family);
        return -1;
      }
    }

    if (!strncmp(result, "model", 5)) {
      sscanf(result, "%*s%*s%d", &model);
    }
  }

  fclose(fff);

  switch (model) {
  case 26:
  case 30:
  case 31: /* nhm */
    printf("Found Nehalem CPU\n");
    is_core2 = 0;
    break;

  case 46: /* nhm-ex */
    printf("Found Nehalem-EX CPU\n");
    is_core2 = 0;
    break;

  case 37:
  case 44: /* wsm */
    printf("Found Westmere CPU\n");
    is_core2 = 0;
    break;

  case 47: /* wsm-ex */
    printf("Found Westmere-EX CPU\n");
    is_core2 = 0;
    break;

  case 42: /* snb */
    printf("Found Sandybridge CPU\n");
    is_core2 = 0;
    break;

  case 45: /* snb-ep */
    printf("Found Sandybridge-EP CPU\n");
    is_core2 = 0;
    break;

  case 58: /* ivb */
    printf("Found Ivybridge CPU\n");
    is_core2 = 0;
    break;

  case 62: /* ivb-ep */
    printf("Found Ivybridge-EP CPU\n");
    is_core2 = 0;
    break;

  case 60:
  case 69:
  case 70: /* hsw */
    printf("Found Haswell CPU\n");
    is_core2 = 0;
    break;

  case 63: /* hsw-ep */
    printf("Found Haswell-EP CPU\n");
    is_core2 = 0;
    break;

  case 61:
  case 71: /* bdw */
    printf("Found Broadwell CPU\n");
    is_core2 = 0;
    break;

  case 86:
  case 79: /* bdw-DE/EP */
    printf("Found Broadwell-DE/EP CPU\n");
    is_core2 = 0;
    break;

  case 78:
  case 94: /* Skylake */
    printf("Found Skylake CPU\n");
    is_core2 = 0;
    break;

  case 85: /* Skylake / Cascade Lake Server*/
    printf("Found Skylake / Cascadelake Server CPU\n");
    is_core2 = 0;
    break;

  case 142:
  case 158: /* Kabylake */
    printf("Found Kabylake CPU\n");
    is_core2 = 0;
    break;

    /* Core 2 */

  case 15:
  case 22:
  case 23:
  case 29: /* core2 */
    printf("Found Core2 CPU\n");
    is_core2 = 1;
    break;
    break;
  default:
    printf("Unsupported model %d\n", model);
    is_core2 = -1;
    break;
  }

  return is_core2;
}

/* Enable prefetch on nehalem and newer */
static int show_prefetch_nhm(int core) {

  int fd;
  int result;
  int begin, end, c;

  printf("Show all prefetch\n");

  if (core == -1) {
    begin = 0;
    end = 1024;
  } else {
    begin = core;
    end = core;
  }

  for (c = begin; c <= end; c++) {

    fd = open_msr(c);
    if (fd < 0)
      break;

    /* Read original results */
    result = read_msr(fd, NHM_PREFETCH_MSR);

    printf("\tCore %d old : L2HW=%c L2ADJ=%c DCU=%c DCUIP=%c\n", c,
           result & 0x1 ? 'N' : 'Y', result & 0x2 ? 'N' : 'Y',
           result & 0x4 ? 'N' : 'Y', result & 0x8 ? 'N' : 'Y');

    close(fd);
  }

  return 0;
}

int main(int argc, char **argv) {

  auto c = detect_cpu();
  if (c < 0) {
    printf("Unsupported CPU type\n");
    return -1;
  }

  show_prefetch_nhm(0);

  return 0;
}

Benchmark

TPCH-100

One TiFlash Store

Two TiFlash Store

ClickBench

[TBD] ClickBench Enhancement

Optimize Aggregation In ClickBench

延迟物化

JIT

Ossinsight（PingCAP 内部测试）

优化过程

优化 Multi Key Sort

优化 Aggregation And Join

CPU Cache 踩坑

Non-temporal Memory Copy

优化 memcpy

Benchmark of avx2_inline_memcpy

优化 memcmp memequal memchr strstr

Implementation Of Mem Utils

Benchmark Of Mem Utils

优化字符串比较

字符串比较

constant 字符串比较

优化字符串搜索

优化字符串排序

Others

Hardware Prefetch

Benchmark of `avx2_inline_memcpy`