Execute SQL Panic: TiDB Load mysql.user Failed Due to Out of Memory

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: execute sql panic tidb load mysql.user fail out of memory

| username: Steve阿辉

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.2
[Reproduction Path] What operations were performed that caused the issue.

Last night, we made some parameter modifications. There are currently two operations, and we are not sure which one caused the issue. We executed the following on the MySQL client: set GLOBAL tidb_mem_quota_query = 10240000; This size is around 9MB. The reason for this is that we initially set a larger value, but the query results did not change each time. So, we modified it multiple times and eventually removed a few zeros.

After noticing no effect, we went to the configuration parameters and modified two parameters: tidb_mem_quota_query and performance.txn-total-size-limit. After a rolling update, we found that TiDB could not start and kept reporting errors.

Upon further investigation, we found that the performance.txn-total-size-limit was set smaller than tidb_mem_quota_query.

However, during our operations, we had two PDs, two TiDBs, and three TiKVs. One server with IP 71 had both PD and TiDB installed. During the rolling update, it did not update successfully. At this time, another TiDB client with IP 70 had already crashed, but we could still connect to TiDB through IP 71, although it would throw a memory overflow error.

After restarting the TiDB cluster, we encountered an issue where PD could not elect a leader. We tried many methods but could not resolve it. We restored the entire cluster, except for the TiKV data disk, to the snapshot from 2 AM on the 25th. All other disks, including the TiUP machine, were restored.

[Encountered Issue: Symptoms and Impact] Currently, the TiPD node cannot start, while other nodes start normally. My guess is that either the system table is corrupted or the parameters were solidified and stored in TiKV. We are unsure how to fix this, and the production environment is currently down. Please help us.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: wakaka | Original post link

How did you modify the parameters? Can’t you use edit-config to make the changes and then reload?

| username: Steve阿辉 | Original post link

Now entering the backend, editing the parameters, and then reloading, but it still cannot start.

| username: 考试没答案 | Original post link

Is the error still the same as before?? If it’s different, please send the error again.

| username: 考试没答案 | Original post link

Can I stop the cluster??

Try making changes in the config file.

| username: wakaka | Original post link

Which backend are you referring to? Have you modified it using tiup edit-config?

| username: 考试没答案 | Original post link

He mentioned that the backend might also be the conf/tom file. I suggest adding parameters to edit-config and then restarting the entire cluster to see if it works.

| username: 考试没答案 | Original post link

My test here failed to start, it’s not related to the parameters. It can start successfully… The parameters are set very small, which might be causing the OOM behavior.

| username: Steve阿辉 | Original post link

Modified configuration.

global:
user: tidb
ssh_port: 22
ssh_type: builtin
deploy_dir: /data/tidb-deploy
data_dir: /data/tidb-data
os: linux
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: /data/tidb-deploy/monitor-9100
data_dir: /data/tidb-data/monitor-9100
log_dir: /data/tidb-deploy/monitor-9100/log
server_configs:
tidb:
binlog.enable: false
binlog.ignore-error: false
log.slow-threshold: 300
mem-quota-query: 1073741824
new_collations_enabled_on_first_bootstrap: false
performance.txn-total-size-limit: 1073741824
tikv:
readpool.coprocessor.use-unified-pool: true
readpool.storage.use-unified-pool: false
pd:
schedule.leader-schedule-limit: 4
schedule.region-schedule-limit: 2048
schedule.replica-schedule-limit: 64
tidb_dashboard: {}
tiflash: {}
tiflash-learner: {}
pump: {}
drainer: {}
cdc: {}
kvcdc: {}
grafana: {}
tidb_servers:

  • host: 172.16.16.70
    ssh_port: 22
    port: 4000
    status_port: 10080
    deploy_dir: /data/tidb-deploy/tidb-4000
    log_dir: /data/tidb-deploy/tidb-4000/log
    arch: amd64
    os: linux
    tikv_servers:
  • host: 172.16.16.80
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /data/tidb-deploy/tikv-20160
    data_dir: /data/tidb-data/tikv-20160
    log_dir: /data/tidb-deploy/tikv-20160/log
    arch: amd64
    os: linux
  • host: 172.16.16.81
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /data/tidb-deploy/tikv-20160
    data_dir: /data/tidb-data/tikv-20160
    – INSERT –

This is the current configuration.

| username: Steve阿辉 | Original post link

I currently suspect that this parameter is set too low, causing the user permission table to fail to read successfully. However, I edited the configuration through tiup cluster edit-config tidb-prod. I posted the configuration in another reply, please take a look. I don’t know what to do at the moment.

| username: Steve阿辉 | Original post link

Since we restored the system disks of TiUP, TiDB, PD, and these nodes, including TiKV, the only thing that could be affected is the data disk of TiKV. Therefore, we suspect that the parameter persistence was written to TiKV, causing it to fail to start. We are currently looking for a way to modify the persistent parameter options within TiKV.

| username: Steve阿辉 | Original post link

From the official documentation, it is known that the parameters in the configuration file are no longer effective and have been removed in version 6.1, all of which have been persisted to system variables. TiDB 6.1.0 Release Notes | PingCAP 文档中心

tidb_mem_quota_query modification adds GLOBAL scope, and the variable value can be persisted to the cluster.

TiDB mem-quota-query has been removed and converted to the system variable tidb_mem_quota_query. This configuration item is no longer effective. If you need to modify it, you need to modify the corresponding system variable. Please seek an official method to modify it when TiDB cannot start and the variable cannot be modified through the client.

| username: Minorli-PingCAP | Original post link

Modify the Toml file, but it will be overwritten by tiup. Use it temporarily.

| username: h5n1 | Original post link

Global variables are fixed in TiKV and can only be modified through TiDB.