PD Leader OOM

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD leader oom

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.1.0
[Reproduction Path] None
[Encountered Problem: Problem Phenomenon and Impact]
PD node leader OOM, TiDB access to PD encountered a failure. Log:

[2024/01/17 15:30:11.960 +08:00] [WARN] [util.go:163] ["apply request took too long"] [took=182.652093ms] [expected-duration=100ms] [prefix=] [request="header:<ID:15889696433221208426 > txn:<compare:<target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise-collection\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-rd-stat-deal-suffix\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-corporate\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-realtime\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-wxpay-datalake\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-payment-orders-many-prd\" mod_revision:7048499857  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-alipay\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-ads-adi-prd\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-mutex-user-remit-period\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-broker-stat-v3-income-record-prd\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-broker-bills-many-prod\" mod_revision:7048499857  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-broker-vol3\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-online-bp-many-prod-c\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-external\" mod_revision:7048499856  target:VALUE key:\"/tidb/cdc/default/__cdc_meta__/meta/ticdc-delete-etcd-key-count\" value_size:2 > success:<request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise-collection\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-rd-stat-deal-suffix\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-corporate\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-realtime\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-wxpay-datalake\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-payment-orders-many-prd\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-alipay\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-ads-adi-prd\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-mutex-user-remit-period\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-broker-stat-v3-income-record-prd\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-broker-bills-many-prod\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-broker-vol3\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-online-bp-many-prod-c\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-external\" value_size:130 >> failure:<>>"] [response=size:190] []

| username: tidb菜鸟一只 | Original post link

Check if the memory on the machine is entirely occupied by PD? Are there other processes? Is it a mixed deployment?

| username: wangccsy | Original post link

Is the memory allocation too small?

| username: ShawnYan | Original post link

Is your CDC having issues?

/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-external" value_size:130 >> failure:<>>

| username: TIDB-Learner | Original post link

Suspect that deploying TiCDC and PD together may cause TiCDC synchronization to be interrupted. When starting, memory overflow occurs, leading to corresponding errors in PD.

| username: TiDBer_yyy | Original post link

CDC is normal, and the machine is deployed separately.

| username: TiDBer_yyy | Original post link

Independent deployment of pd-server

| username: 逍遥_猫 | Original post link

Could you provide 2 more lines above and below the error log? Also, the cluster configuration.

| username: dba远航 | Original post link

The possibility of PD experiencing OOM is very small. Check the REGION situation and whether there is too much concurrency, etc.

| username: Jellybean | Original post link

It is rare to encounter OOM situations on PD node machines, and it is also relatively uncommon from a theoretical mechanism perspective.

Confirm whether there are other processes mixedly deployed on the machine, leading to increased memory usage.

Confirm the data scale of the cluster, the number of Regions, and other information.

At the same time, focus on analyzing the logs of the PD node, pd.log, to obtain the log situation before its restart, that is, check the content before and after the appearance of the Welcome restart keyword to confirm if there are any anomalies.

| username: TiDBer_yyy | Original post link

The number of regions is relatively large, over 2.3 million.
The logs are all info or warning.

At that time, I checked the number of regions through SQL. Not sure if it is related.

SELECT s.store_id, s.address, count(distinct r.REGION_ID) 
FROM INFORMATION_SCHEMA.TIKV_REGION_STATUS as r, INFORMATION_SCHEMA.TIKV_REGION_PEERS as p, INFORMATION_SCHEMA.TIKV_STORE_STATUS as s
WHERE r.REGION_ID = p.REGION_ID AND p.STORE_ID = s.STORE_ID