A TiKV Node in the TiDB Cluster Fails to Start Due to PermissionDenied Issue

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群中某一台tikv启动不了,报权限问题PermissionDenied

| username: 飞-田鼠

【TiDB Environment】Testing
【TiDB Version】v5.4.0
【Encountered Problem】
tikv.log log content:
[2022/08/12 19:26:02.636 +08:00] [FATAL] [lib.rs:465] [“called Result::unwrap() on an Err value: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}
at /home/jenkins/agent/workspace/build-common/go/src/github.
com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18
1: std::panicking::rust_panic_with_hook
at /rustc/2faabf579323f5252329264cc53ba
9ff803429a3/library/std/src/panicking.rs:626:17
2: std::panicking::begin_panic_handler::{{closure}}
at /rustc/2faabf579323f5252329264cc
53ba9ff803429a3/library/std/src/panicking.rs:519:13
3: std::sys_common::backtrace::__rust_end_short_backtrace
at /rustc/2faabf579323f52
52329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18
4: rust_begin_unwind
at /rustc/2faabf579323f5252329264cc53ba9f
f803429a3/library/std/src/panicking.rs:515:5
5: core::panicking::panic_fmt
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/c
ore/src/panicking.rs:92:14
6: core::result::unwrap_failed
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/result.rs
:1599:5
7: core::result::Result<T,E>::unwrap
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/result.rs:1281:23\

server::server::TiKVServer::check_conflict_addr
at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/compone
nts/server/src/server.rs:370:22
server::server::run_tikv
at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tik
v/components/server/src/server.rs:155:9
8: tikv_server::main
at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/ti
kv/cmd/tikv-server/src/main.rs:190:5
9: core::ops::function::FnOnce::call_once
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/libra
ry/core/src/ops/function.rs:227:5
std::sys_common::backtrace::__rust_begin_short_backtrace
at /rustc/2faabf579323f5252329264cc53ba9f
f803429a3/library/std/src/sys_common/backtrace.rs:125:18
10: main
11: __libc_start_main
12:
"] [location=components/server/src/server
.rs:370] [thread_name=main]

【Reproduction Path】
The cluster has a total of 4 machines, only the management host’s TiKV cannot start, while the other 3 machines start normally. Adding TiFlash to the main machine also reports this permission issue, and the service cannot start. It feels like a user disk access permission issue, but other services like TiDB and PD on the main machine start normally and logs are generated normally. It’s very strange. Does anyone know the reason?
【Problem Phenomenon and Impact】

【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: xfworld | Original post link

Was the user’s authorization created through tiup?

All the tiflash instances are in tombstone status and can be cleaned up.

| username: forever | Original post link

Check if the directory permissions of 235 are the same as those of other nodes.

| username: 飞-田鼠 | Original post link

The permissions are the same. I used the expansion method to create it, and the directories and logs were all created, but the process couldn’t start. Checking the logs, it reported the above error. The same goes for tiflash, so the status is tombstone. It can only be deleted with --force, and it didn’t work after several attempts.

| username: 飞-田鼠 | Original post link

Even with the same permissions, setting it to 777 doesn’t work. I don’t understand why.

| username: 飞-田鼠 | Original post link

All were created using tiup, and TiFlash also encountered this permission issue, causing it to fail to start.

| username: 飞-田鼠 | Original post link

I’ll bump this up. Has anyone encountered this before?

| username: xfworld | Original post link

How about finding two new machines (as new nodes) and rejoining the cluster to try? [For the nodes that had issues before, just take them offline.]

| username: 飞-田鼠 | Original post link

I feel that this is a user permission issue with TiDB. When I directly use root to start the script under tikv-deploy and execute run_tikve.sh, it can be executed, and at this time, the display shows the status as up. Files are also generated in the /tikv-data directory, but the owner of the permissions is root. The process cannot be executed under the tidb user. Is there a problem with the permissions of the tidb user?

| username: xfworld | Original post link

That is still an authorization issue. By default, the tiup command is executed as root, which will uniformly create the tidb user, as well as the corresponding data directory, log directory, and grant the appropriate permissions and permission groups.

If both root and tidb are involved, it indicates inconsistency… Can you understand this?

| username: 飞-田鼠 | Original post link

I installed TiDB using the tidb account, and tiup was also executed under the tidb account. All file directories also belong to tidb. Everything works fine with the tidb account, but issues arise only when executing tikv and tiflash operations. All other processes are normal.

| username: 飞-田鼠 | Original post link

It has successfully started, which feels a bit inexplicable. I performed the following two steps: chmod +777 -R /tmp, which changes all directories under /tmp to 777 permissions, and another operation: /etc/security/limits.conf
tidb soft stack 10485760
tidb hard stack 10485760
changed to 10485760. I feel it might be a permission issue with the /tmp directory. It is estimated that tikv and tiflash might need to operate on a certain folder under the /tmp directory when starting. It is known that tikv is: 1002_TIKV_LOCK_FILES, but actually, before I changed the directory permissions, /tmp was already 777, and there was no 1002_TIKV_LOCK_FILES directory under tikv. I am recording this for now, and anyone who encounters this can give it a try.

| username: alfred | Original post link

It is generally a user permission issue, 777 is a bit too much.

| username: system | Original post link

This topic will be automatically closed 60 days after the last reply. No new replies are allowed.