Business Scenario: To achieve dynamic field data storage for user configurations, we created a single table with over 2000 columns in TiDB to store user data. Since TiDB’s maximum number of indexed columns per table is 512 and the maximum number of columns per table is 4096, we had to enable TiFlash for the table. Users can configure any column as a query condition. As the data volume increased, slow SQL queries began to appear (based on the default settings).
Specific Steps:
Created a single table with over 2000 columns (500 columns each for strings, numbers, dates, and arrays, plus a few indexed columns). Initially, with a small data volume, there were no slow queries.
After some time, we decided to add 500 more columns for each type (strings, numbers, dates, arrays), making 1000 columns for each type. This resulted in slow delete, update, and some select queries, and occasionally TiDB server overload errors.
Deleted the table, recreated it, and enabled TiFlash. Each type (strings, numbers, dates, arrays) was restored to 500 columns. As the user data volume increased, slow delete, update, and select queries reappeared, and sometimes TiDB server load overload errors occurred.
Is there an optimization solution for our wide table scenario on the server side?
For business communication vertical partitioning and data partitioning, the expert on the first floor mentioned storing dynamic fields in JSON. Shouldn’t we consider whether JSON is suitable for the scenario?
Create another replica for handling SELECT operations, while the primary center handles UPDATE, INSERT, and DELETE operations. It is recommended to use a delete flag instead of directly deleting data. The primary and replica can synchronize through TiCDC. TiCDC between TiDB instances is quite stable. It is also advisable to split the tables.
With this kind of table design, the database can only compress the data size during transmission. The effectiveness can only be tested. Such a wide table is likely to incur significant additional resource overhead.
In the AP scenario, confirm the valid fields for select. It’s better to avoid this as much as possible and split it appropriately. This is also not easy to maintain.
You might want to try Titan. I personally haven’t used it, but from the description, it seems to be designed for scenarios with larger values.
Titan is a high-performance single-node key-value storage engine plugin based on RocksDB.
When the value is relatively large (above 1 KB or 512 B), Titan outperforms RocksDB in scenarios such as writing, updating, and point reads. However, Titan will occupy more disk space and partially sacrifice range queries. With the decreasing price of SSDs, Titan’s advantages will become more prominent, making it easier for users to choose.
By default, TiDB has a 6 MB limit per row, and there are also length limits for indexes. For performance reasons, it is generally not recommended to adjust these limits. You can try testing Titan.
TiDB is essentially a distributed relational database, and it has requirements for the number of columns. The more columns there are, the higher the cost of reading a single row of data. As the amount of data increases, performance issues are inevitable.
The problem raised by the original poster is similar to data analysis scenarios such as user profiling, user feature analysis, and advertising tag analysis. The solutions for this situation are as follows:
Vertical partitioning: Split the table vertically based on different business attributes to reduce the number of columns in a single row. Keep it within the reasonable range recommended by the official guidelines; otherwise, various unknown issues may arise.
TiDB natively supports the storage of very large tables. Many users have single tables with tens of billions, hundreds of billions, or even trillions of rows, and can maintain good read and write performance. Based on this feature, you can consider adjusting the table structure design by converting columns to rows, storing each attribute in a separate row. Since queries are based on user dimensions, this solution will not have poor query performance.
The downside of this approach is that the number of rows will expand significantly, but the underlying storage is in region data blocks, which will automatically balance across different machines. Therefore, storage and performance are not greatly affected, and the issue of row expansion can be ignored.
If you think the single table is too large, you can also divide different user groups based on user characteristics/business attributes and place different users in different tables, which can also appropriately reduce the number of rows in a table.
For dynamic attribute tag storage, consider using JSON as a replacement. Performance should be thoroughly tested and validated in advance.
If not using TiDB, choose other databases that support sparse two-dimensional data storage, such as HBASE, etc.