How to Query Dirty Data

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何查询到脏数据

| username: Mwkk

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
TiDB version 4.0.8, error reported when adding a unique index

Table structure:

CREATE TABLE `operation_custom_tag_entity_new` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `tag_id` bigint(20) DEFAULT NULL,
  `base_id` bigint(20) DEFAULT NULL,
  `entity_type` int(11) DEFAULT NULL COMMENT 'Type Entity Association Table',
  `qipu_id` bigint(20) DEFAULT NULL COMMENT 'Qipu ID',
  `tag_name` varchar(50) DEFAULT NULL COMMENT 'Custom Tag Name',
  `source_sub_type` varchar(100) DEFAULT NULL COMMENT 'Tag Source Type',
  PRIMARY KEY (`id`),
  KEY `idx_base_id` (`base_id`),
  KEY `idx_tag_id` (`tag_id`),
  KEY `idx_qipuId` (`qipu_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=923060000

When querying the qipu_id field for values of ‘’, there are 4 warnings, indicating that some data might have been truncated. I would like to know how to find these 4 dirty data entries.

| username: zhaokede | Original post link

Theoretically, neither optimistic transactions nor pessimistic transactions can result in dirty reads. Should the transaction isolation level be set to this? I haven’t tried it:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

| username: WalterWj | Original post link

Try setting the session-level SQL mode to empty and then check again :thinking:
Also, is your tag_name column empty?

| username: wakaka | Original post link

Binary search based on ID?

| username: Mwkk | Original post link

Setting sql_mode to empty still results in truncated values during queries.

| username: Mwkk | Original post link

It is for the value of null.

| username: WalterWj | Original post link

null doesn’t matter, null doesn’t affect the unique index. The main issue might be if your name column has ‘’ causing problems. Normally, int type is converted to 0. If you insert ‘’

| username: 裤衩儿飞上天 | Original post link

Try using IS NULL. Do not use =''.

| username: Mwkk | Original post link

The current situation

| username: WalterWj | Original post link

Check the id 0 and see.
Check the admin table.

| username: WalterWj | Original post link

If the data volume is not large, you can query like this:

(root@10.102.58.180) [test]>SELECT qipu_id, tag_name, COUNT(*) as cnt
    -> FROM operation_custom_tag_entity_new
    -> GROUP BY qipu_id, tag_name
    -> HAVING cnt > 1;
+---------+----------+-----+
| qipu_id | tag_name | cnt |
+---------+----------+-----+
|       0 |          |   2 |
+---------+----------+-----+
1 row in set (0.01 sec)
| username: Mwkk | Original post link

The default value of tidb_distsql_scan_concurrency is 15. You can adjust it according to your actual situation.

| username: Mwkk | Original post link

This result set has more than 30,000 entries. I guess I’ll have to ask the developers to clean it up bit by bit.

| username: WalterWj | Original post link

That’s it.

| username: Kongdom | Original post link

:flushed: What is this result? I don’t understand why neither is null nor = '' can find anything.