How is the storage size of TiDB calculated? I have some doubts, can anyone help?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb的存储大小是怎么计算的,有点疑问,各位大哥大姐帮帮忙

| username: 舞动梦灵

For example, in this diagram, the default is three replicas. So, is the actual data only 13TB, or is there 39TB of data? I conducted a test to verify this. In MySQL, I created 24GB of data and synchronized it to TiDB through DM. Then, each KV showed 8GB of data, totaling 24GB, which is similar to the 39TB in the diagram. So, is the 24GB of MySQL data compressed to 8GB in TiDB, or is the 24GB of data evenly distributed across each KV? Is the total data 24GB when the three KVs are combined?

| username: Fly-bird | Original post link

How many replicas?

| username: 舞动梦灵 | Original post link

Isn’t the default 3 replicas according to the official documentation? I haven’t adjusted any parameters.

| username: tidb菜鸟一只 | Original post link

If TiDB has three replicas, the capacity is roughly similar to MySQL because TiDB storage is compressed, which can reduce the storage capacity to about one-third of MySQL’s. However, since three replicas are set, the capacity becomes basically the same.

| username: 舞动梦灵 | Original post link

If it is really compressed to one-third, then the total capacity of the three replicas would be the same as the original data capacity, which makes sense. I couldn’t understand it before because I didn’t see any mention of a compression feature. So, if using BR for backup, is there compression? For example, in my case of 24G with three replicas of 8G each, how much would the backup be? Should it be 8G or 24G?

| username: tidb菜鸟一只 | Original post link

8G, because when br backs up, it only backs up the leader. That is, regardless of how many replicas you have, it only backs up the replica corresponding to the leader.

| username: 舞动梦灵 | Original post link

Understood. Thank you, leader. If I scale out and then scale in the kv of this leader, will it automatically transfer? Also, will the pd leader automatically transfer, or do I need to manually input a command to transfer it?

| username: tidb菜鸟一只 | Original post link

There are two types of leaders. For example, in the case of PD, if you have three nodes, one of them is the leader, and the other two are followers. The other two essentially act as backups for the leader and cannot work independently. Of course, you can connect to them and make requests, but they won’t directly return the request results to you. Instead, they will forward the request to the leader and then return the results to you.

The leader in TiKV is not node-specific but region-specific. A single TiKV node can have numerous regions, some of which are leaders and some are followers. Followers do not provide external services (unless follower read is enabled). When you need to query data from a specific region, you first request PD to find out which TiKV node has the leader for that region, and then you request the leader node of that region on the corresponding TiKV.

In the case of scaling, if you are scaling TiKV, after scaling, the leaders and followers of regions on other nodes will be balanced to the new TiKV node. If you are scaling PD, the leader will not automatically transfer after scaling. Manual transfer is required. The leader will only transfer to other follower nodes if your leader node goes down.

| username: 舞动梦灵 | Original post link

Bro, awesome explanation. “If you expand the PD, after the expansion, the leader will not automatically transfer and needs to be manually transferred. Only if your leader node goes down will it transfer to other follower nodes.”

Can I understand your sentence like this: for example, I have 3 KV nodes, and I expand one more. If I’m too lazy to manually transfer, I can directly scale down the KV node with the leader and then shut it down directly. It will automatically transfer to other follower nodes without me needing to manually operate, right?

| username: tidb菜鸟一只 | Original post link

PD and TiKV are different.
Currently, you have three PDs. You can directly remove or decommission the leader PD, and naturally, another PD follower will become the leader to handle requests. After that, you can add another PD to ensure there are three PD nodes again. (Actually, one PD can also provide normal service, but for production environments, it is recommended to have three nodes for higher security.)
However, if you have three TiKVs and three replicas, you cannot decommission one of them normally because you need to ensure three replicas for each region. You must first add a TiKV node before decommissioning one. Of course, if one TiKV node suddenly fails, the cluster can still provide service because of the majority rule; with three replicas, losing one still leaves two operational. However, this situation is unsafe, and you need to quickly add a TiKV to ensure all regions have three replicas.

| username: 舞动梦灵 | Original post link

Understood. Basically, it means ensuring at least three replicas. If there are not enough, it is actually okay, but there is a bit of risk. If I deploy only one KV server for the test database, it defaults to three replicas, but in reality, there is only one replica, right? Unless you specify three different partitions under the KV section in the configuration file, then it would be three replicas, correct?

| username: tidb菜鸟一只 | Original post link

You set up three replicas, but it shouldn’t work with only one TiKV node. You must configure three TiKV nodes. You can also deploy three TiKV nodes on the same server as long as the ports are different.

| username: 舞动梦灵 | Original post link

On the same machine, having inconsistent port partitions is possible; I have tested it. However, during my first deployment, I only set up one. The official website has a quick deployment guide, and for KV, it only requires one machine without mentioning multiple ports. I just wrote one machine in the parameter file, and it worked. It can be used normally.

| username: tidb菜鸟一只 | Original post link

The default replica for quick deployment is 1 replica, not three replicas…

| username: 舞动梦灵 | Original post link

Did he get any prompts? Is it 3 replicas or 1 replica? I followed this method and successfully started with a single KV address.

| username: 像风一样的男子 | Original post link

The default installation uses three replicas. You can use the following command to check the number of replicas:

./pd-ctl config show all | grep max-replicas
| username: zhanggame1 | Original post link

A single-machine deployment with three replicas can start with just one TiKV, and the performance is quite good when running on a single machine.

| username: zhanggame1 | Original post link

Initialization is set to 3 replicas by default. You can also use it normally by setting it to one replica.
To modify:

set config pd 'replication.max-replicas'=1;

To query:

show config where NAME like '%max-replicas%';
| username: tidb菜鸟一只 | Original post link

Didn’t you use tiup playground for quick deployment? That would be a single replica.
Normally, your installation should have three replicas, but it shouldn’t be successful with only one tikv node. You can check how many replicas there are by running:
show config where NAME like ‘%max-replicas%’

| username: 舞动梦灵 | Original post link

It shows 3 replicas, and all the services for this test are on one machine: TiDB, PD, and TiKV.

tiup cluster deploy tidb-test v7.1.1 ./topology.yaml --user root [-p] [-i /home/root/.ssh/gcp_rsa]