Service Instability Caused by Update and Delete Operations

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 更新和删除操作引发的服务不稳定

| username: tandoy

[TiDB Usage Environment] Poc
[TiDB Version] V7.5.0
[Reproduction Path] Single table update and delete exceeding 500,000 records
[Encountered Issue: Problem Phenomenon and Impact] TiDB node and TiFlash node crashed
[Resource Configuration]


[Attachment: Screenshot/Log/Monitoring]
nslookup domain advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc.svc success
starting pd-server …
/pd-server --data-dir=/var/lib/pd --name=advanced-tidb-pd-1 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc:2379 --config=/etc/pd/pd.toml
[2023/12/25 10:21:58.120 +08:00] [INFO] [versioninfo.go:89] [“Welcome to Placement Driver (PD)”]
[2023/12/25 10:21:58.120 +08:00] [INFO] [versioninfo.go:90] [PD] [release-version=v7.5.0]
[2023/12/25 10:21:58.120 +08:00] [INFO] [versioninfo.go:91] [PD] [edition=Community]
[2023/12/25 10:21:58.120 +08:00] [INFO] [versioninfo.go:92] [PD] [git-hash=ef6ba8551e525a700546d6bdb7ad6766115209cc]
[2023/12/25 10:21:58.120 +08:00] [INFO] [versioninfo.go:93] [PD] [git-branch=heads/refs/tags/v7.5.0]
[2023/12/25 10:21:58.120 +08:00] [INFO] [versioninfo.go:94] [PD] [utc-build-time=“2023-11-16 10:31:04”]
[2023/12/25 10:21:58.120 +08:00] [INFO] [metricutil.go:86] [“disable Prometheus push client”]
[2023/12/25 10:21:58.120 +08:00] [INFO] [server.go:249] [“PD config”] [config=“{"client-urls":"http://0.0.0.0:2379","peer-urls":"http://0.0.0.0:2380","advertise-client-urls":"http://advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc:2379","advertise-peer-urls":"http://advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc:2380","name":"advanced-tidb-pd-1","data-dir":"/var/lib/pd","force-new-cluster":false,"enable-grpc-gateway":true,"initial-cluster":"advanced-tidb-pd-1=http://advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc:2380","initial-cluster-state":"new","initial-cluster-token":"pd-cluster","join":"","lease":3,"log":{"level":"info","format":"text","disable-timestamp":false,"file":{"filename":"","max-size":0,"max-days":0,"max-backups":0},"development":false,"disable-caller":false,"disable-stacktrace":false,"disable-error-verbose":true,"sampling":null,"error-output-path":""},"max-concurrent-tso-proxy-streamings":5000,"tso-proxy-recv-from-client-timeout":"1h0m0s","tso-save-interval":"3s","tso-update-physical-interval":"50ms","enable-local-tso":false,"metric":{"job":"advanced-tidb-pd-1","address":"","interval":"15s"},"schedule":{"max-snapshot-count":64,"max-pending-peer-count":64,"max-merge-region-size":20,"max-merge-region-keys":0,"split-merge-interval":"1h0m0s","switch-witness-interval":"1h0m0s","enable-one-way-merge":"false","enable-cross-table-merge":"true","patrol-region-interval":"10ms","max-store-down-time":"30m0s","max-store-preparing-time":"48h0m0s","leader-schedule-limit":4,"leader-schedule-policy":"count","region-schedule-limit":2048,"witness-schedule-limit":4,"replica-schedule-limit":64,"merge-schedule-limit":8,"hot-region-schedule-limit":4,"hot-region-cache-hits-threshold":3,"store-limit":{},"tolerant-size-ratio":0,"low-space-ratio":0.8,"high-space-ratio":0.7,"region-score-formula-version":"v2","scheduler-max-waiting-operator":5,"enable-remove-down-replica":"true","enable-replace-offline-replica":"true","enable-make-up-replica":"true","enable-remove-extra-replica":"true","enable-location-replacement":"true","enable-debug-metrics":"false","enable-joint-consensus":"true","enable-tikv-split-region":"true","schedulers-v2":[{"type":"balance-region","args":null,"disable":false,"args-payload":""},{"type":"balance-leader","args":null,"disable":false,"args-payload":""},{"type":"balance-witness","args":null,"disable":false,"args-payload":""},{"type":"hot-region","args":null,"disable":false,"args-payload":""},{"type":"transfer-witness-leader","args":null,"disable":false,"args-payload":""}],"schedulers-payload":null,"hot-regions-write-interval":"10m0s","hot-regions-reserved-days":7,"max-movable-hot-peer-size":512,"enable-diagnostic":"true","enable-witness":"false","slow-store-evicting-affected-store-ratio-threshold":0.3,"store-limit-version":"v1"},"replication":{"max-replicas":3,"location-labels":"host","strictly-match-label":"false","enable-placement-rules":"true","enable-placement-rules-cache":"false","isolation-level":""},"pd-server":{"use-region-storage":"true","max-gap-reset-ts":"24h0m0s","key-type":"table","runtime-services":"","metric-storage":"","dashboard-address":"auto","trace-region-flow":"true","flow-round-by-digit":3,"min-resolved-ts-persistence-interval":"1s","server-memory-limit":0,"server-memory-limit-gc-trigger":0.7,"enable-gogc-tuner":"false","gc-tuner-threshold":0.6,"block-safe-point-v1":"false"},"cluster-version":"0.0.0","labels":{},"quota-backend-bytes":"8GiB","auto-compaction-mode":"periodic","auto-compaction-retention-v2":"1h","TickInterval":"500ms","ElectionInterval":"3s","PreVote":true,"max-request-bytes":157286400,"security":{"cacert-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"SSLCABytes":null,"SSLCertBytes":null,"SSLKEYBytes":null,"redact-info-log":false,"encryption":{"data-encryption-method":"plaintext","data-key-rotation-period":"168h0m0s","master-key":{"type":"plaintext","key-id":"","region":"","endpoint":"","path":""}}},"label-property":null,"WarningMsgs":null,"DisableStrictReconfigCheck":false,"HeartbeatStreamBindInterval":"1m0s","LeaderPriorityCheckInterval":"1m0s","dashboard":{"tidb-cacert-path":"","tidb-cert-path":"","tidb-key-path":"","public-path-prefix":"","internal-proxy":true,"enable-telemetry":false,"enable-experimental":false},"replication-mode":{"replication-mode":"majority","dr-auto-sync":{"label-key":"","primary":"","dr":"","primary-replicas":0,"dr-replicas":0,"wait-store-timeout":"1m0s","pause-region-split":"false"}},"keyspace":{"pre-alloc":null,"wait-region-split":true,"wait-region-split-timeout":"30s","check-region-split-interval":"50ms"},"controller":{"degraded-mode-wait-duration":"0s","ltb-max-wait-duration":"30s","request-unit":{"read-base-cost":0.125,"read-per-batch-base-cost":0.5,"read-cost-per-byte":0.0000152587890625,"write-base-cost":1,"write-per-batch-base-cost":1,"write-cost-per-byte":0.0009765625,"read-cpu-ms-cost":0.3333333333333333}}}”]
[2023/12/25 10:21:58.124 +08:00] [INFO] [apiutil.go:391] [“register REST path”] [path=/pd/api/v1]
[2023/12/25 10:21:58.124 +08:00] [INFO] [apiutil.go:391] [“register REST path”] [path=/pd/api/v2/]
[2023/12/25 10:21:58.124 +08:00] [INFO] [apiutil.go:391] [“register REST path”] [path=/autoscaling]
[2023/12/25 10:21:58.124 +08:00] [INFO] [distro.go:51] [“using distribution strings”] [strings={}]
[2023/12/25 10:21:58.125 +08:00] [INFO] [apiutil.go:391] [“register REST path”] [path=/dashboard/api/]
[2023/12/25 10:21:58.125 +08:00] [INFO] [apiutil.go:391] [“register REST path”] [path=/dashboard/]
[2023/12/25 10:21:58.125 +08:00] [INFO] [registry.go:92] [“restful API service registered successfully”] [prefix=advanced-tidb-pd-1] [service-name=MetaStorage]
[2023/12/25 10:21:58.125 +08:00] [INFO] [apiutil.go:391] [“register REST path”] [path=/resource-manager/api/v1/]
[2023/12/25 10:21:58.125 +08:00] [INFO] [registry.go:92] [“restful API service registered successfully”] [prefix=advanced-tidb-pd-1] [service-name=ResourceManager]
[2023/12/25 10:21:58.125 +08:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls=“[http://0.0.0.0:2380]”]
[2023/12/25 10:21:58.125 +08:00] [INFO] [systimemon.go:30] [“start system time monitor”]
[2023/12/25 10:21:58.128 +08:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls=“[http://0.0.0.0:2379]”]
[2023/12/25 10:21:58.128 +08:00] [INFO] [etcd.go:611] [“pprof is enabled”] [path=/debug/pprof]
[2023/12/25 10:21:58.128 +08:00] [INFO] [etcd.go:305] [“starting an etcd server”] [etcd-version=3.4.21] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.21.3] [go-os=linux] [go-arch=amd64] [max-cpu-set=8] [max-cpu-available=8] [member-initialized=true] [name=advanced-tidb-pd-1] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls=“[http://advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc:2380]”] [listen-peer-urls=“[http://0.0.0.0:2380]”] [advertise-client-urls=“[http://advanced-tidb-pd-1.advanced-tidb-pd-peer.tidb.svc:2379]”] [listen-client-urls=“[http://0.0.0.0:2379]”] [listen-metrics-urls=“”] [cors=“[]“] [host-whitelist=”[]”] [initial-cluster=] [initial-cluster-state=new] [initial-cluster-token=] [quota-backend-bytes=8589934592] [max-request-bytes=157286400] [max-concurrent-streams=4294967295] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2023/12/25 10:21:58.128 +08:00] [WARN] [server.go:297] [“exceeded recommended request limit”] [max-request-bytes=157286400] [max-request-size=“157 MB”] [recommended-request-bytes=10485760] [recommended-request-size=“10 MB”]
2023-12-25 10:21:58.128809 W | pkg/fileutil: check file permission: directory “/var/lib/pd” exist, but the permission is “drwxr-xr-x”. The recommended permission is “-rwx------” to prevent possible unprivileged access to the data.
[2023/12/25 10:21:58.138 +08:00] [INFO] [backend.go:80] [“opened backend db”] [path=/var/lib/pd/member/snap/db] [took=9.273019ms]
[2023/12/25 10:21:58.426 +08:00] [INFO] [server.go:462] [“recovered v2 store from snapshot”] [snapshot-index=200002] [snapshot-size=“11 kB”]
[2023/12/25 10:21:58.427 +08:00] [INFO] [kvstore.go:388] [“restored last compact revision”] [meta-bucket-name=meta] [meta-bucket-name-key=finishedCompactRev] [restored-compact-revision=242132]
[2023/12/25 10:21:58.435 +08:00] [INFO] [server.go:480] [“recovered v3 backend from snapshot”] [backend-size-bytes=2879488] [backend-size=“2.9 MB”] [backend-size-in-use-bytes=2301952] [backend-size-in-use=“2.3 MB”]
[2023/12/25 10:21:58.819 +08:00] [INFO] [raft.go:586] [“restarting local member”] [cluster-id=9d2429267b258a82] [local-member-id=f428869ae0a40756] [commit-index=275152]
[2023/12/25 10:21:58.821 +08:00] [INFO] [raft.go:1523] [“f428869ae0a40756 switched to configuration voters=(9617113101266017370 15951949195044304961 17593459944074774358)”]
[2023/12/25 10:21:58.821 +08:00] [INFO] [raft.go:706] [“f428869ae0a40756 became follower at term 3”]
[2023/12/25 10:21:58.821 +08:00] [INFO] [raft.go:389] [“newRaft f428869ae0a40756 [peers: [8576d96975b75c5a,dd60b5469db1b841,f428869ae0a40756], term: 3, commit: 275152, applied: 200002, lastindex: 275153, lastterm: 3]”]
[2023/12/25 10:21:58.822 +08:00] [INFO] [capability.go:76] [“enabled capabilities for version”] [cluster-version=3.4]
[2023/12/25 10:21:58.822 +08:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster-id=9d2429267b258a82] [local-member-id=f428869ae0a40756] [recovered-remote-peer-id=8576d96975b75c5a] [recovered-remote-peer-urls=“[http://advanced-tidb-pd-2.advanced-tidb-pd-peer.tidb.svc:2380]”]
[2023/12/25 10:21:58.822 +08:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster-id=9d2429267b258a82] [local-member-id=f428869ae0a40756] [recovered-remote-peer-id=dd60b5469db1b841] [recovered-remote-peer-urls="[http://advanced-tidb-pd-0.advanced

| username: 饭光小团 | Original post link

Did the TiDB node run out of memory (OOM)?

| username: 像风一样的男子 | Original post link

Check with tiup list to see if pd1 is down. Have the two PDs elected a leader?

| username: tandoy | Original post link

It should be, most likely a machine connected to a query client will go down.

| username: tandoy | Original post link

Shouldn’t the PD deployment be in odd numbers?

| username: tidb菜鸟一只 | Original post link

Is your PD down? Did you only deploy 2 PD nodes when you set it up?

| username: tidb狂热爱好者 | Original post link

The official statement says to delete 1000 entries at a time, right?

| username: 小龙虾爱大龙虾 | Original post link

It’s normal, it’s normal. When deploying TiDB on k8s, avoid making too many major changes.

| username: 像风一样的男子 | Original post link

Yes, a leader needs to be elected, and two nodes will cause a split-brain scenario.

| username: tandoy | Original post link

After restarting a node, updating around 400,000 data points causes the cluster to crash. Should we switch from a k8s deployment to a physical machine deployment?

| username: 像风一样的男子 | Original post link

K8s sharing server resources also incurs additional network overhead, so it definitely performs much worse than a dedicated server.

| username: 像风一样的男子 | Original post link

Check the monitoring to see which component has failed, and first investigate where the problem lies.

| username: tandoy | Original post link

We are in a mixed deployment. Typically, one TiDB and PD will go down, then TiDB will restart itself and PD will go offline.

| username: TIDB-Learner | Original post link

Mixed deployment and a large amount of data deletion and modification caused TiDB, TiFlash, and other nodes to crash and restart. It appears to be due to insufficient resources.

| username: dba远航 | Original post link

I suggest checking the monitoring status.

| username: zhanggame1 | Original post link

It seems like the memory is not enough.

| username: 春风十里 | Original post link

The maximum single transaction size hosted by TiDB is 10GB. How large is the estimated size of your 500,000 records?