Abnormalities After Expanding TiDB Nodes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 扩容tidb节点后异常

| username: Leox

There were originally two TiDB nodes. After I scaled down one node and then scaled up one node to a new machine, the new TiDB node exhibited some abnormal behavior.

  1. Conducting sysbench tests separately through the two TiDB nodes, the old node had a much higher QPS than the new node, and there were many errors during the new node’s test.
    Old node:

New node:

  1. When doing prewarm through the new TiDB node, after a long period of preheating, it reported error 9002, tikv server timeout.

  2. There are frequent region scheduling warnings in the TiKV node logs.

  3. Some errors appeared in the new TiDB node.

Is there any solution for this?

| username: Leox | Original post link

Directly deleted the original cluster, created a new cluster after checking, still encountered the above issue. The error log of the new TiDB node is as follows:

| username: Kongdom | Original post link

If the issue persists after rebuilding, it might be due to hardware configuration problems with the new machine. Are the bandwidth, IOPS, memory, and CPU up to standard or consistent with the original node?

| username: Leox | Original post link

The new node has the same conditions as the original node except for the hard drive, but TiDB shouldn’t be hard drive-intensive, right? :rofl:

| username: db_user | Original post link

Are the versions of the various components unified? If not rebuilt, it might be a cache issue with TiDB itself. If rebuilt, it feels like a resource issue or a version issue.

| username: Kongdom | Original post link

Is it a mixed deployment?

| username: Leox | Original post link

The new node is not mixed.

From configuration one to configuration three, the newly added machine in configuration three has only one NUMA node, which is equivalent to half of the first three machines.

| username: Leox | Original post link

The versions are consistent. After rebuilding, shouldn’t the versions all be the same? :rofl:

| username: h5n1 | Original post link

Check if there are any network issues between the newly added node and the other nodes.

| username: Leox | Original post link

I checked with iperf and it does seem to be a network issue. The network cards are the same, so I’ll try changing the network cable :rofl:

| username: Kongdom | Original post link

There are monitoring options in Grafana to view network conditions.

| username: Leox | Original post link

Thank you, everyone! It worked normally after changing the network cable. @ Kongdom @ db_user @ h5n1

| username: Leox | Original post link

Got it, thanks!

| username: Kongdom | Original post link

:+1: :+1: :+1:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.