Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb有没有数据抽样方法
[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.4
[Reproduction Path] What operations were performed to cause the problem
I want to sample a certain table. Currently, one method is
But this method is very slow when the table is slightly larger. Is there a faster sampling method?
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
For random sampling of a certain table, you can’t specify which table it is, right? There are hundreds of tables that need to be sampled.
Why not just use limit without adding order by?
Adding order by not only makes the data non-random, but sorting also becomes a performance issue.
You can refer to the usage of MySQL. The ORDER BY RAND()
is the slowest implementation, and the larger the table, the slower it gets. You can try:
SELECT *
FROM `table` AS t1 JOIN (SELECT ROUND(RAND() * (SELECT MAX(id) FROM `table`)) AS id) AS t2
WHERE t1.id >= t2.id
ORDER BY t1.id ASC LIMIT 5;
SELECT * FROM sbtest1
WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM sbtest1
) ORDER BY id LIMIT 1; — Do it this way
Run it 10 times
If it’s PostgreSQL, you can actually use tablesample, and Oracle can use sample, but it seems that TiDB doesn’t have this feature. It is recommended to add it.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.