Panic Occurs When PD Handles Hot Regions

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd处理hot region时panic

| username: Timber

[TiDB Usage Environment] Production Environment
[TiDB Version] tikv v6.1.0
[Reproduction Path] Environment experienced a power outage, some TiKV nodes failed and kept restarting. PD nodes would periodically panic and restart.
[Encountered Problem: Symptoms and Impact]
After the TiKV cluster restarted following the power outage, some nodes could not recover. PD can provide services but will restart periodically (every ten to twenty minutes), with panic information related to handling hotspot regions. During this period, using tikv-ctl bad-ssts to detect bad SSTs fails due to PD restarts, making it impossible to further repair TiKV nodes.

[Attachment: Screenshot/Log/Monitoring] PD panic log:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1bc477e]

goroutine 488 [running]:
github.com/tikv/pd/server.(*Handler).packHotRegions(0xc0005ada70, 0x1a38d66?, {0x2708d13, 0x4})
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/handler.go:1050 +0x37e
github.com/tikv/pd/server.(*Handler).PackHistoryHotReadRegions(0xc001f3ee30?)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/handler.go:1005 +0x3e
github.com/tikv/pd/server/storage.(*HotRegionStorage).pullHotRegionInfo(0xc0008b8980)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/storage/hot_region_storage.go:258 +0x2e
github.com/tikv/pd/server/storage.(*HotRegionStorage).backgroundFlush(0xc0008b8980)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/storage/hot_region_storage.go:218 +0x195
created by github.com/tikv/pd/server/storage.NewHotRegionsStorage
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/storage/hot_region_storage.go:159 +0x21b

| username: xfworld | Original post link

The environment experienced a power outage… :rofl:

Can the PD node still provide services normally? Or is there a problem with the PD as well?

| username: Timber | Original post link

Are there many user feedbacks related to power outages? :smiling_face_with_tear:
PD can provide services normally, but when using the tikv-ctl tool, it needs to connect to PD. If PD restarts in the middle, the tool fails.
From the error logs, it seems like a bug in PD.

| username: xfworld | Original post link

It’s already in production, don’t you have a UPS to keep it running?

First, you need to ensure that PD is working properly. If PD is not functioning correctly, fix PD first…

| username: Timber | Original post link

Isn’t it abnormal to restart irregularly? The key issue is that I don’t know how to fix this error. I just stopped the PD scheduling for now.

| username: tidb菜鸟一只 | Original post link

How many PD nodes do you have? Did all of them lose power? Try switching the primary node.

| username: Anna | Original post link

Wow, fix PD first.

| username: redgame | Original post link

Try a normal restart.

| username: zhanggame1 | Original post link

Did you get it sorted out in the end?

| username: 有猫万事足 | Original post link

It might be related to the following issue.

The 6.1.2 release notes mention that this issue has been fixed.

I suggest upgrading to 6.1.2 and giving it a try.

| username: Timber | Original post link

I raised an issue on GitHub pd panic if peer has no leader · Issue #6647 · tikv/pd · GitHub. A senior mentioned that it was fixed in statistics: show stores of peers in pd-ctl output by lhy1024 · Pull Request #5330 · tikv/pd · GitHub, and looking at the changes, it indeed fixed the error. The problem was resolved in v6.3.0.

| username: Timber | Original post link

v6.3.0 has a fix.

| username: 有猫万事足 | Original post link

It is recommended to use the LTS version in the production environment.

“Development Milestone Releases (DMR) are released approximately every two months. If an LTS release occurs, the DMR release time will be postponed by two months. DMR introduces new features, improvements, and fixes. However, TiDB does not provide patch versions based on DMR, and any related bugs will be gradually fixed in subsequent version series.”

The issue with the DMR version is that there are no subsequent patches. It is not highly recommended for use in a production environment.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.