Is there a data sampling method in TiDB?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb有没有数据抽样方法

| username: 大飞飞呀

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.4
[Reproduction Path] What operations were performed to cause the problem
I want to sample a certain table. Currently, one method is


But this method is very slow when the table is slightly larger. Is there a faster sampling method?
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb狂热爱好者 | Original post link

Use partitioned tables.

| username: 大飞飞呀 | Original post link

For random sampling of a certain table, you can’t specify which table it is, right? There are hundreds of tables that need to be sampled.

| username: 我是咖啡哥 | Original post link

Why not just use limit without adding order by?
Adding order by not only makes the data non-random, but sorting also becomes a performance issue.

| username: forever | Original post link

You can refer to the usage of MySQL. The ORDER BY RAND() is the slowest implementation, and the larger the table, the slower it gets. You can try:

SELECT *
FROM `table` AS t1 JOIN (SELECT ROUND(RAND() * (SELECT MAX(id) FROM `table`)) AS id) AS t2
WHERE t1.id >= t2.id
ORDER BY t1.id ASC LIMIT 5;
| username: tidb菜鸟一只 | Original post link

SELECT * FROM sbtest1 WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM sbtest1 ) ORDER BY id LIMIT 1; — Do it this way
Run it 10 times
If it’s PostgreSQL, you can actually use tablesample, and Oracle can use sample, but it seems that TiDB doesn’t have this feature. It is recommended to add it.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.