TiKV and TiFlash causing 100% CPU usage, restart status N/A, generating a large amount of data with only IDs and other fields empty

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv,tiflash 导致cpu100%,重启状态为N/A,产生大量只有id,其他字段为空数据

| username: TiDBer_vZ6DLO0F

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.2
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Phenomenon and Impact]
TiKV TiFlash CPU at 100%, after scaling out, some tables show only IDs, other fields are empty.

CPU 100% image:

After restarting TiKV and TiFlash, the status is N/A, unable to determine the status

Checking system info monitoring, found anomalies in network and TCP:

PD error log:

After scaling out and restarting, found that business data is abnormal, with a large amount of data showing only IDs and other fields being empty, indicating data loss. As shown below:

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: Jellybean | Original post link

Please check the PD logs to see if there are any anomalies.

| username: TiDBer_vZ6DLO0F | Original post link

The cluster has crashed, and the nodes where TiKV and TiFlash are located have their CPUs at 100%.

| username: zhaokede | Original post link

Has the data replication progress been completed?

| username: 有猫万事足 | Original post link

7 PDs?

Among them, 6 PDs are deployed together with TiKV and TiFlash, which is not recommended.
PDs may struggle to get CPU execution time.
It is recommended to deploy PDs together with TiDB.

I feel like the PDs are not starting up.
So the entire cluster is not coming up.

| username: xiaoqiao | Original post link

Print out the logs to check the specific situation.

| username: TIDB-Learner | Original post link

It is recommended to deploy TiKV and TiFlash independently. TiDB and PD can be deployed together.

| username: 呢莫不爱吃鱼 | Original post link

Separate TiKV and TiFlash, the CPU is maxed out.

| username: DBAER | Original post link

This is a resource contention issue. TiKV and TiFlash are deployed on the same host. Also, why are there so many PDs? What is the architecture design?

| username: 友利奈绪 | Original post link

TiKV and TiFlash are deployed on the same host, competing for resources.

| username: dba远航 | Original post link

Check if there are any anomalies in PD.

| username: tidb菜鸟一只 | Original post link

Your topology setup is quite puzzling. Deploying PD, TiDB, TiKV, and TiFlash on those machines with 122GB of memory… It would be better to find two machines to deploy TiFlash separately, deploy TiDB and PD on machines with 60GB of memory, and use the remaining 122GB machines to deploy two TiKV instances.

TiFlash can easily consume the entire server’s CPU. If your PD leader is also on this machine, it could cause the PD to crash.

Right now, you can’t even start PD, so TiKV and TiFlash definitely won’t start. You could try starting PD separately to see if it works.

| username: TiDBer_JUi6UvZm | Original post link

TiKV and TiFlash do not have resource isolation, and then Regions are widely distributed across these 6 machines, leading to comprehensive resource competition between TP and AP. This is probably the main reason, right?

| username: TiDBer_JUi6UvZm | Original post link

What exactly did you expand in this scaling operation? Was the CPU usage normal before the scaling? Was the business functioning normally? Were there any issues with the data? Additionally, can you identify any errors from the logs?

| username: TiDBer_JUi6UvZm | Original post link

The urgent task is to quickly investigate the scope of the data impact. Check if there are any backups of the data, whether it needs to be restored, and how to restore it. It is recommended to separate TiKV and TiFlask in the future. Follow the official deployment recommendations.

| username: TiDBer_vZ6DLO0F | Original post link

This deployment topology is hard to describe. After scaling and restarting, the database is back to normal, but data was lost and is currently being replenished. We need to quickly adjust the deployment architecture moving forward. The operations team is currently investigating the cause of the issue, and the impact of the incident is significant. :face_exhaling:

| username: TiDBer_vZ6DLO0F | Original post link

The problem description above posted the error logs of PD. At the time of the issue, there were a large number of error logs.

| username: TiDBer_vZ6DLO0F | Original post link

I’m very curious about one thing right now: why does TiDB lose underlying data? The data in the table only has IDs, and all other fields are empty. There is no reasonable explanation for this, and I haven’t found any information through searches.

| username: 托马斯滑板鞋 | Original post link

Set the TiFlash replica to 0, then check to see if the data is still there.

| username: tidb菜鸟一只 | Original post link

Try manually executing the business SQL on the TiDB server to see if this situation occurs.