Does TiDB include the functionality to fake statistics?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB是否包含伪造统计信息的功能

| username: realcp1018

TiDB is positioned as a scalable distributed database, with billions to tens of billions of tables in production. Sometimes, when new requirements or problem SQLs arise, relying solely on existing statistics can only roughly estimate the effect of adding new indexes, but it is not enough to confirm 100%. I remember that Oracle has the ability to fake statistics for the optimizer to use (not too sure), does TiDB have plans to add this feature or does it currently have this functionality?

| username: tidb菜鸟一只 | Original post link

TiDB itself follows this logic. When statistics are outdated, it uses a pseudo-random number generator to estimate statistics for generating execution plans. As far as I remember, Oracle only dynamically estimates statistics and generates execution plans when the statistics are empty. If there are statistics available, it generates execution plans based on them. If you want to fake it, you can only export the statistics of a table with identical columns and index information and then import them into the current table to achieve the faking effect.

| username: 啦啦啦啦啦 | Original post link

Since statistics can be imported and exported, it should also be possible to directly modify the exported JSON file and then import it to falsify the data.

| username: realcp1018 | Original post link

Now that you mention it, I suddenly realize that my statement might be wrong. What I meant was to forge an index and then perform an execution plan query based on the new index.

| username: realcp1018 | Original post link

Yes, I think what I need is to simulate creating a new index for a table and generate virtual statistics, then observe the response of the SQL optimizer based on this.

| username: Kongdom | Original post link

Are you referring to this?

| username: 有猫万事足 | Original post link

It feels like two sides of the same coin.

Create a fake index and look at the execution plan on the real statistics.
Or create a real index and look at the execution plan on the fake statistics.

I haven’t tried the content of this document, so I’m not sure if the latter can be achieved.
If the latter can be done, the cost of both is about the same.
Create a real index on an empty table and provide fake statistics.

| username: realcp1018 | Original post link

Here is a rough description of the usage scenario:
I have a table with billions of rows, and I found that a certain SQL execution is causing the IO to be fully utilized. Therefore, after reviewing the histogram and buckets statistics of the relevant columns, I decided to add an index. I can roughly predict that this index might be effective, but I cannot 100% predict the SQL optimizer’s reaction to this index. If I end up spending time creating the index only to find that it does not work as expected, then a lot of effort would have been wasted.
My ideal handling method:
Create a virtual index and collect statistics for this virtual index (either automatically or manually, or even import them), and then observe the optimizer’s reaction to it.

| username: realcp1018 | Original post link

It feels like two sides of the same coin.

Create a fake index and look at the execution plan on the real statistics.
Or create a real index and look at the execution plan on the fake statistics.

I haven’t tried the content of this document, so I’m not sure if the latter can be achieved.
If the latter can be done, the cost of both is about the same.
Create a real index on an empty table and provide fake statistics.

:disguised_face: What you said is indeed a very clever method. You can create a new table and import the statistics!
It would be great if there were built-in virtual indexes and automatically generated statistics.

| username: tidb菜鸟一只 | Original post link

I don’t think it needs to be that complicated. Create an identical table, import 1% of the data from the original table, then create the index you want on this new table. Check if the same SQL queries on the original table show any improvement. If they do, then you can create the same index on the original table.

| username: 有猫万事足 | Original post link

I followed the documentation, and it works.

After exporting the JSON file, change the following fields in the JSON file to point to a fake empty table:

 "database_name": "test",
 "table_name": "t",

Then just load the stats from this JSON file. You might also need to lock the stats. I didn’t lock it initially, and after creating a new index, the statistics changed, so I had to reload the stats again.

After that, you can check if the execution plan uses the index. The operation is not as convenient as a virtual index, but the effect is almost the same.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.