背景
之前一直对索引分析中 type range 有误解
下面是官方文档对 type range 的解释
range
can be used when a key column is compared to a constant using any of the=
,<>
,>
,>=
,<
,<=
,IS NULL
,<=>
,BETWEEN
,LIKE
, orIN()
operators:
1
2
3
4
5
6
7
8
9
10
11 SELECT * FROM tbl_name
WHERE key_column = 10;
SELECT * FROM tbl_name
WHERE key_column BETWEEN 10 and 20;
SELECT * FROM tbl_name
WHERE key_column IN (10,20,30);
SELECT * FROM tbl_name
WHERE key_part1 = 10 AND key_part2 IN (10,20,30);
IN
查询下的 range 什么时候会恶化为 ALL
呢?按照我之前的理解,可能有两点
- 数量太多;这点是一个老生常谈的问题,例如
IN
需要不能传过多的值,否则优化器不会选择走索引 - 范围太大;事实证明,
IN
和范围无关
上述两点中,第二点是错误的,范围太大并不会导致 IN
的索引
type 退化为 ALL
这篇文档就是进行验证和记录答案
理清概念
type
在 EXLAIN 中,type 指的是 join type
The
type
column ofEXPLAIN
output describes how tables are joined. In JSON-formatted output, these are found as values of theaccess_type
property.
对于 type 的好坏这里就不再赘述
从好到坏为:
system
该表只有一行(系统表);特殊的 constconst
该表最多有一个匹配行,该行在查询开始时读取eq_ref
对于前一个表中的每一行组合,都会从该表中读取一行;出现在联表查询时,当联接使用索引的所有部分,并且该索引是 PRIMARY KEY 或 UNIQUE NOT NULL 索引时ref
对于前一个表中的每一个行组合,都会从此表中读取索引值匹配的所有行;出现在联表查询时,如果联接不能根据键值选择一行fulltext
联表查询使用 FULLTEXT 索引ref_or_null
类似于ref
,但 MySQL 会对包含NULL
值的行进行额外搜索index_merge
使用了索引合并(Index Merge)优化unique_subquery
含有eq_ref
的子查询index_subquery
类似unique_subquery
,但索引是非唯一约束range
只检索给定范围内的行,使用索引选择行index
与ALL
相同,只是扫描了索引树;区别在于索引覆盖或者按照索引顺序回表ALL
完整的表扫描
也就是说 type 表达的是 JOIN 下索引的执行类型(单表可以理解为特殊的 JOIN)
rows
The
rows
column indicates the number of rows MySQL believes it must examine to execute the query.For
InnoDB
tables, this number is an estimate, and may not always be exact.
这里有一个疑问,rows 指的是扫描索引的行数还是回表的行数?
官方文档只表达为 must examine to execute
下面这里做一个验证,我有这样一个表
1 | CREATE TABLE `city` ( |
执行下面的分析
1 | EXPLAIN SELECT * FROM city WHERE CountryCode = "BRA" AND name = "Fortaleza"; |
可以看到使用了索引 CountryCode
,type 为
ref
,行数为 250,我们表中 CountryCode = "BRA"
对应的数据就是 250 行
如果我们在索引上加上 name
,可以猜测使用新的索引 rows
应该是 1 行
1 | ALTER TABLE city ADD KEY `idx_CountryCode_name`(`CountryCode`,`name`); |
再次执行分析
1 | EXPLAIN SELECT * FROM city WHERE CountryCode = "BRA" AND name = "Fortaleza"; |
和猜想的没错,rows 其实表示的是回表的行数
如果我们将语句修改为只查询 name 呢?会产生索引覆盖,那么 rows 如何显示?
1 | EXPLAIN SELECT name FROM city WHERE CountryCode = "BRA" AND name = "Fortaleza"; |
可以看到 Extra 告诉我们 Using index
使用了索引覆盖,但是
rows 还是 1
不过不管怎么说,我们可以认为 rows 表达的是回表的行,起码不是代表扫过的索引;事实上从语义上也可以这么理解,索引中也不会有 row 这个概念(BTree)
验证
和数量有关
1 | EXPLAIN SELECT * FROM city WHERE CountryCode IN ("ABW", "AFG", "AGO"); |
可以看到上面这个语句成功走了索引,type 为 range
如果加大 IN
中的值数量
1 | EXPLAIN SELECT * FROM city WHERE CountryCode IN ("ABW", "AFG", "AGO", "AIA", "ALB", "AND", "ANT", "ARE", "ARG", "ARM", "ASM", "ATG", "AUS", "AUT", "AZE", "BDI", "BEL", "BEN", "BFA", "BGD", "BGR", "BHR", "BHS", "BIH", "BLR", "BLZ", "BMU", "BOL", "BRA", "BRB", "BRN", "BTN", "BWA", "CAF", "CAN", "CCK", "CHE", "CHL", "CHN", "CIV", "CMR", "COD", "COG", "COK", "COL", "COM", "CPV", "CRI", "CUB", "CXR", "CYM", "CYP", "CZE", "DEU", "DJI", "DMA", "DNK", "DOM", "DZA", "ECU", "EGY", "ERI", "ESH", "ESP", "EST", "ETH", "FIN", "FJI", "FLK", "FRA", "FRO", "FSM", "GAB", "GBR", "GEO", "GHA", "GIB", "GIN", "GLP", "GMB", "GNB", "GNQ", "GRC", "GRD", "GRL", "GTM", "GUF", "GUM", "GUY", "HKG", "HND", "HRV", "HTI", "HUN", "IDN", "IND", "IRL", "IRN", "IRQ", "ISL", "ISR", "ITA", "JAM", "JOR", "JPN", "KAZ", "KEN", "KGZ", "KHM"); |
可以看到走了 type 为 ALL 的全表扫描
和范围有关
1 | SELECT MIN(CountryCode) FROM city; |
1 | SELECT DISTINCT(CountryCode) FROM city ORDER BY CountryCode LIMIT 110, 1; |
得到 CountryCode
最小值和中间值为 ABW
和
KNA
1 | EXPLAIN SELECT * FROM city WHERE CountryCode IN ("ABW", "KNA"); |
可以发现依然是 range,和 IN
中的范围没有关系
而且扫描行数是 2,这是不是可以说明表中值为 ABW
和
KNA
的数据只有两条呢?(虽然 InnoDB
引擎下的行数不一定准确)
1 | SELECT * FROM city WHERE CountryCode IN ("ABW", "KNA"); |
结果确实是 2 行
结论
为什么和数量有关
Read about Limiting Memory Use for Range Optimization.
When you have a large list of values in an
IN()
predicate, it uses more memory during the query optimization step. This was considered a problem in some cases, so recent versions of MySQL set a max memory limit (it's 8MB by default).If the optimizer finds that it would need more memory than the limit, there is not another condition in your query it can use to optimize, it gives up trying to optimize, and resorts to a table-scan. I infer that your table statistics actually show that the table has ~221 million rows (though table statistics are inexact estimates).
I can't say I know the exact formula to know how much memory is needed for a given list of values, but given your observed behavior, we could guess that it's about 600 bytes per item on average, given that 14k items works and more than that does not work.
You can set
range_optimizer_max_mem_size = 0
to disable the memory limit. This creates a risk of excessive memory use, but it avoids the optimizer "giving up." We set this value on all MySQL instances at my last job, because we couldn't educate the developers to avoid creating huge lists of values in their queries.
总结:
- 优化器认为应该走全表扫描
- 索引使用的内存超出了
range_optimizer_max_mem_size
的限制
range_optimizer_max_mem_size
的默认值为
8388608
即 8 MB
1 | SET SESSION range_optimizer_max_mem_size = 1; |
InnoDB 引擎下 rows 近似值
对于 InnoDB 引擎而言,EXPLAIN 下的 rows 数字是一个估计值,可能并不总是准确的
因为 InnoDB 存储引擎使用了一种称为多版本并发控制(MVCC)的机制来处理并发事务,MVCC 允许多个事务同时读取和修改表中的数据,而不会相互干扰
为了支持 MVCC,InnoDB 使用 Undo Log 用于存储旧版本的数据
当一个事务修改了表中的数据时,InnoDB 会将旧版本的数据存储在回滚段中,以便其他事务可以读取到一致的数据视图
由于回滚段中可能存在多个事务的旧版本数据,InnoDB 无法精确地知道表中的行数。它只能根据回滚段中的数据估计出大致的行数
eq_range_index_dive_limit
eq_range_index_dive_limit
这个参数在控制什么呢?
其实控制的是对扫描行数的预估
简单来说就是根据 eq_range_index_dive_limit
参数设置的阀值来按照不同算法预估影响行数,对于 IN
或
OR
条件中的每个范围段视为一个元组,对于元组数低于
eq_range_index_dive_limit
参数阀值时使用 index dive
算法,对于高于阈值使用 index statistics 算法
- index dive 针对每个元组 dive 到 index 中使用索引完成元组数的估算,类似于使用索引进行实际查询得到影响行数
- index statistics
根据索引的统计数值进行估算,例如索引统计信息计算出每个等值影响 100
条数据,那么
IN
条件中包含 5 个等值则影响5 * 100
条记录
eq_range_index_dive_limit
在 5.7 版本下默认
200
可以看到将 eq_range_index_dive_limit
设置为 1
时,优化器认为超出了阈值,使用 index statistics 进行估算,返回的 rows
是不符合实际情况的估算值
参考
MySQL :: MySQL 5.7 Reference Manual :: 8.8.2 EXPLAIN Output Format
in用不用索引,啥时候能用啥时候不能用,一文说清 - 知乎 (zhihu.com)
MySQL 'IN' operator on large number of values - Stack Overflow