Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: pd处理hot region时panic
[TiDB Usage Environment] Production Environment
[TiDB Version] tikv v6.1.0
[Reproduction Path] Environment experienced a power outage, some TiKV nodes failed and kept restarting. PD nodes would periodically panic and restart.
[Encountered Problem: Symptoms and Impact]
After the TiKV cluster restarted following the power outage, some nodes could not recover. PD can provide services but will restart periodically (every ten to twenty minutes), with panic information related to handling hotspot regions. During this period, using tikv-ctl bad-ssts to detect bad SSTs fails due to PD restarts, making it impossible to further repair TiKV nodes.
[Attachment: Screenshot/Log/Monitoring] PD panic log:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1bc477e]
goroutine 488 [running]:
github.com/tikv/pd/server.(*Handler).packHotRegions (0xc0005ada70, 0x1a38d66?, {0x2708d13, 0x4})
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/handler.go:1050 +0x37e
github.com/tikv/pd/server.(*Handler).PackHistoryHotReadRegions(0xc001f3ee30?)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/handler.go:1005 +0x3e
github.com/tikv/pd/server/storage.(*HotRegionStorage).pullHotRegionInfo(0xc0008b8980)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/storage/hot_region_storage.go:258 +0x2e
github.com/tikv/pd/server/storage.(*HotRegionStorage).backgroundFlush(0xc0008b8980)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/storage/hot_region_storage.go:218 +0x195
created by github.com/tikv/pd/server/storage.NewHotRegionsStorage
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/storage/hot_region_storage.go:159 +0x21b
The environment experienced a power outage…
Can the PD node still provide services normally? Or is there a problem with the PD as well?
Are there many user feedbacks related to power outages?
PD can provide services normally, but when using the tikv-ctl tool, it needs to connect to PD. If PD restarts in the middle, the tool fails.
From the error logs, it seems like a bug in PD.
It’s already in production, don’t you have a UPS to keep it running?
First, you need to ensure that PD is working properly. If PD is not functioning correctly, fix PD first…
Isn’t it abnormal to restart irregularly? The key issue is that I don’t know how to fix this error. I just stopped the PD scheduling for now.
How many PD nodes do you have? Did all of them lose power? Try switching the primary node.
Did you get it sorted out in the end?
It might be related to the following issue.
opened 02:42AM - 13 Sep 22 UTC
closed 11:33AM - 13 Sep 22 UTC
type/bug
severity/major
affects-6.1
## Bug Report
### What did you do?
### What did you expect to see?…
### What did you see instead?

```
[2022/09/11 13:18:38.927 +00:00] [INFO] [operator_controller.go:450] ["add operator"] [region-id=86007] [operator="\"rule-split-region {split: region 86007 use policy USEKEY and keys [7480000000000000FF5800000000000000F8]} (kind:split, region:86007(61, 39), createAt:2022-09-11 13:18:38.92786507 +0000 UTC m=+116.187335222, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[split region with policy USEKEY])\""] [additional-info="{\"region-end-key\":\"7480000000000000FF5900000000000000F8\",\"region-start-key\":\"7480000000000000FF5700000000000000F8\"}"]
[2022/09/11 13:18:38.928 +00:00] [INFO] [operator_controller.go:652] ["send schedule command"] [region-id=86007] [step="split region with policy USEKEY"] [source=create]
[2022/09/11 13:18:38.928 +00:00] [FATAL] [log.go:72] [panic] [recover="\"invalid memory address or nil pointer dereference\""] [stack="github.com/tikv/pd/pkg/logutil.LogPanic\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/logutil/log.go:72\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:838\nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:220\nruntime.sigpanic\n\t/usr/local/go/src/runtime/signal_unix.go:818\ngithub.com/tikv/pd/server/schedule/checker.(*RuleChecker).fixLooseMatchPeer\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/schedule/checker/rule_checker.go:242\ngithub.com/tikv/pd/server/schedule/checker.(*RuleChecker).fixRulePeer\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/schedule/checker/rule_checker.go:156\ngithub.com/tikv/pd/server/schedule/checker.(*RuleChecker).CheckWithFit\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/schedule/checker/rule_checker.go:118\ngithub.com/tikv/pd/server/schedule/checker.(*Controller).CheckRegion\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/schedule/checker/checker_controller.go:96\ngithub.com/tikv/pd/server/cluster.(*coordinator).patrolRegions\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/cluster/coordinator.go:149"]
```
### What version of PD are you using (`pd-server -V`)?
/ # /pd-server -V
Release Version: v6.1.1
Edition: Community
Git Commit Hash: 4ab9c0ef123441a0ef279bf9d2e36d1abe4a14c1
Git Branch: heads/refs/tags/v6.1.1
UTC Build Time: 2022-08-23 08:25:18
The 6.1.2 release notes mention that this issue has been fixed.
I suggest upgrading to 6.1.2 and giving it a try.
It is recommended to use the LTS version in the production environment.
了解 TiDB 版本发布的规则。
“Development Milestone Releases (DMR) are released approximately every two months. If an LTS release occurs, the DMR release time will be postponed by two months. DMR introduces new features, improvements, and fixes. However, TiDB does not provide patch versions based on DMR, and any related bugs will be gradually fixed in subsequent version series.”
The issue with the DMR version is that there are no subsequent patches. It is not highly recommended for use in a production environment.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.