本文转载自微信公众号「DBA随笔」,作者DBA随笔。转载本文请联系DBA随笔公众号。
今天在线上遇到了一个MySQL字符比较的问题,感觉很有意思,专门研究了下,估计大家都没有遇到过,这里跟大家分享一下。
1.背景
背景介绍:
MySQL里面有一张表,根据where条件匹配查询某一条记录的时候,手误输入了一个空格,发现这一条数据仍然能查出来,我建了一个测试表,还原如下:
22:57:02> create table t00 (id int primary key,name varchar(10));
Query OK, 0 rows affected (0.01 sec)
22:57:11> insert into t00 values (1,'aaa'),(2,'bbb');
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0
22:57:22> select * from t00 where name='aaa';
+----+------+
| id | name |
+----+------+
| 1 | aaa |
+----+------+
1 row in set (0.00 sec)
22:57:32> select * from t00 where name='aaa ';
+----+------+
| id | name |
+----+------+
| 1 | aaa |
+----+------+
1 row in set (0.00 sec)
插入(1,'aaa')这条记录,使用where='aaa'和'aaa '这两个条件去匹配,居然都能够查到这条记录。
一开始我怀疑是这个8.0.19版本MySQL实例配置有问题,换了一个5.5低版本的MySQL实例,再次测试,还是复现这个问题。看来不是版本上的问题,一定是某种配置的问题。
晚上回到家,又用了自己搭建的一个8.0.22版本的MySQL实例重新执行上面的命令,竟然惊奇的发现,不复现了。。。晕死。8.0.22版本测试的结果是:
23:35:30>>select * from t0;
+------+------+
| id | name |
+------+------+
| 1 | aaa |
| 2 | bbb |
+------+------+
2 rows in set (0.01 sec)
23:35:34>>select * from t0 where name='aaa';
+------+------+
| id | name |
+------+------+
| 1 | aaa |
+------+------+
1 row in set (0.00 sec)
23:35:46>>select * from t0 where name='aaa ';
Empty set (0.00 sec)
2.分析思路
1)为什么'aaa'和'aaa '一样?
首先我用命令在MySQL上检测了一下这两个字符串在MySQL中是否一样:
### MySQL实例一
23:39:09> select 'aaa' = 'aaa ';
+------------------+
| 'aaa' = 'aaa ' |
+------------------+
| 1 |
+------------------+
1 row in set (0.00 sec)
### MySQL实例二
23:35:54>>select 'aaa' = 'aaa ';
+------------------+
| 'aaa' = 'aaa ' |
+------------------+
| 0 |
+------------------+
1 row in set (0.00 sec)
从上面的结果可以看出来,这两个实例上,关于字符的比较规则不一样。
到这里,可能部分同学就已经知道答案了。不过还是往下再看看。
2)比较规则哪里不一样?
我们可以用下面的命令,先看一下utf8相关的字符集下的比较规则,如下:
23:45:18> show collation like 'utf8%';
+----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
| utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD |
| utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD |
| utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD |
| utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE |
| utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | PAD SPACE |
| utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 0 | NO PAD |
| utf8mb4_cs_0900_as_cs | utf8mb4 | 289 | | Yes | 0 | NO PAD |
| utf8_unicode_ci | utf8 | 192 | | Yes | 8 | PAD SPACE |
........
| utf8_vietnamese_ci | utf8 | 215 | | Yes | 8 | PAD SPACE |
+----------------------------+---------+-----+---------+----------+---------+---------------+
103 rows in set (0.00 sec)
在最后一列,我们可以看到一个pad属性,这个属性里面包含2个值,分别是no pad 和pad space。
3)尝试去官方文档中查找这俩属性的意思
果然,不出意外,找到了一些蛛丝马迹:
https://dev.mysql.com/doc/refman/8.0/en/char.html
To determine the pad attribute for a collation, use the INFORMATION_SCHEMA COLLATIONS table, which has a PAD_ATTRIBUTE column.
For nonbinary strings (CHAR, VARCHAR, and TEXT values), the string collation pad attribute determines treatment in comparisons of trailing spaces at the end of strings. NO PAD collations treat trailing spaces as significant in comparisons, like any other character. PAD SPACE collations treat trailing spaces as insignificant in comparisons; strings are compared without regard to trailing spaces.
上面这段话描述的意思大概是:
要确定排序规则的填充属性,请使用 information_schema.collations 表,该表具有 pad_attribute 列。
对于非二进制字符串(char,varchar和text),字符串的填充属性决定了比较字符串末尾空格时的处理方式。
NO PAD 排序规则将尾随空格视为重要的比较,更加严格,就像任何其他字符一样;
PAD SPACE 排序规则在比较中将尾随空格视为无关紧要,比较字符串时不考虑尾随空格,也就是有无空格一个样。
这里我们就可以根据实际使用的比较规则来查看对应的pad属性了:
先看实例一:
### MySQL实例一
00:01:31>show variables like '%colla%';
+-------------------------------+--------------------+
| Variable_name | Value |
+-------------------------------+--------------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8mb4_0900_ai_ci |
| collation_server | utf8mb4_0900_ai_ci |
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+
4 rows in set (0.01 sec)
00:01:45>select collation_name,character_set_name,pad_attribute from information_schema.collations where collation_name like 'utf8_gen
eral_ci';
+-----------------+--------------------+---------------+
| collation_name | character_set_name | pad_attribute |
+-----------------+--------------------+---------------+
| utf8_general_ci | utf8 | PAD SPACE |
+-----------------+--------------------+---------------+
1 row in set (0.00 sec)
再来看实例二:
### 实例二
mysql--root@localhost:(none) 23:53:52>>show variables like '%colla%';
+-------------------------------+--------------------+
| Variable_name | Value |
+-------------------------------+--------------------+
| collation_connection | utf8mb4_0900_ai_ci |
| collation_database | utf8mb4_0900_ai_ci |
| collation_server | utf8mb4_0900_ai_ci |
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+
4 rows in set (0.00 sec)
00:03:47>>select collation_name,character_set_name,pad_attribute from information_schema.collations where collation_name like 'utf8mb4_0900_ai_ci';
+--------------------+--------------------+---------------+
| collation_name | character_set_name | pad_attribute |
+--------------------+--------------------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | NO PAD |
+--------------------+--------------------+---------------+
1 row in set (0.00 sec)
到这里,真相大白。
实例一的连接比较规则是utf8_general_ci,对应的填充规则是pad space属性,代表字符比较过程中,末尾空格不重要,所以加不加空格结果都是一样的;
实例二的连接比较规则是utf8mb4_0900_ai_ci,对应的填充规则是no pad属性,代表字符比较过程中,末尾空格重要,所以加不加空格结果不一样。
3.如何让字符匹配更严格?
1)修改连接的比较规则为utf8mb4_0900_ai_ci,当然,这个修改需要搭配默认字符集
这个方案比较容易理解,不赘述。
2)使用like模糊匹配进行比较
3)where条件之前,添加binary关键字
上述2、3两种方法可见下面的测试:
00:19:13>select * from t00;
+----+------+
| id | name |
+----+------+
| 1 | aaa |
| 2 | bbb |
+----+------+
2 rows in set (0.00 sec)
00:19:18>select * from t00 where name='aaa';
+----+------+
| id | name |
+----+------+
| 1 | aaa |
+----+------+
1 row in set (0.00 sec)
00:19:28>select * from t00 where name='aaa ';
+----+------+
| id | name |
+----+------+
| 1 | aaa |
+----+------+
1 row in set (0.00 sec)
### 下面两种方案,可以防止'aaa '匹配到'aaa'
00:19:31>select * from t00 where name like 'aaa ';
Empty set (0.00 sec)
00:19:57>select * from t00 where binary name = 'aaa ';
Empty set (0.00 sec)
今天文章就到这里吧。