Abnormalities After Expanding TiDB Nodes

translator_bot · June 22, 2024, 4:36pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 扩容tidb节点后异常

| username: Leox

There were originally two TiDB nodes. After I scaled down one node and then scaled up one node to a new machine, the new TiDB node exhibited some abnormal behavior.

Conducting sysbench tests separately through the two TiDB nodes, the old node had a much higher QPS than the new node, and there were many errors during the new node’s test.
Old node:

image1380×434 214 KB

New node:

When doing prewarm through the new TiDB node, after a long period of preheating, it reported error 9002, tikv server timeout.

image1380×300 177 KB
There are frequent region scheduling warnings in the TiKV node logs.

1d3fe918c1f2f00fba8d7685f6fc7d71865×631 76 KB
Some errors appeared in the new TiDB node.

16771517643561874×988 162 KB

Is there any solution for this?

translator_bot · June 22, 2024, 4:36pm

| username: Leox | Original post link

Directly deleted the original cluster, created a new cluster after checking, still encountered the above issue. The error log of the new TiDB node is as follows:

translator_bot · June 22, 2024, 4:36pm

| username: Kongdom | Original post link

If the issue persists after rebuilding, it might be due to hardware configuration problems with the new machine. Are the bandwidth, IOPS, memory, and CPU up to standard or consistent with the original node?

translator_bot · June 22, 2024, 4:36pm

| username: Leox | Original post link

The new node has the same conditions as the original node except for the hard drive, but TiDB shouldn’t be hard drive-intensive, right?

translator_bot · June 22, 2024, 4:36pm

| username: db_user | Original post link

Are the versions of the various components unified? If not rebuilt, it might be a cache issue with TiDB itself. If rebuilt, it feels like a resource issue or a version issue.

translator_bot · June 22, 2024, 4:36pm

| username: Kongdom | Original post link

Is it a mixed deployment?

translator_bot · June 22, 2024, 4:36pm

| username: Leox | Original post link

The new node is not mixed.

From configuration one to configuration three, the newly added machine in configuration three has only one NUMA node, which is equivalent to half of the first three machines.

translator_bot · June 22, 2024, 4:36pm

| username: Leox | Original post link

The versions are consistent. After rebuilding, shouldn’t the versions all be the same?

translator_bot · June 22, 2024, 4:36pm

| username: h5n1 | Original post link

Check if there are any network issues between the newly added node and the other nodes.

translator_bot · June 22, 2024, 4:36pm

| username: Leox | Original post link

I checked with iperf and it does seem to be a network issue. The network cards are the same, so I’ll try changing the network cable

translator_bot · June 22, 2024, 4:36pm

| username: Kongdom | Original post link

There are monitoring options in Grafana to view network conditions.

translator_bot · June 22, 2024, 4:36pm

| username: Leox | Original post link

Thank you, everyone! It worked normally after changing the network cable. @ Kongdom @ db_user @ h5n1

translator_bot · June 22, 2024, 4:36pm

| username: Leox | Original post link

Got it, thanks!

translator_bot · June 22, 2024, 4:36pm

| username: Kongdom | Original post link

translator_bot · June 22, 2024, 4:36pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.