TiKV failed to start, insufficient number of nofile, configuration not effective

translator_bot · June 22, 2024, 9:49pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv启动失败，nofile数量太少，配置后不生效

| username: TiDBer_ZfFjmcZo

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] v6.4.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact] tikv failed to start
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 9:49pm

| username: xfworld | Original post link

After configuring, try restarting the system?

translator_bot · June 22, 2024, 9:49pm

| username: TiDBer_ZfFjmcZo | Original post link

Tried it, doesn’t work.

translator_bot · June 22, 2024, 9:49pm

| username: xfworld | Original post link

How did you install the cluster? Didn’t you use tiup?

translator_bot · June 22, 2024, 9:49pm

| username: 我是咖啡哥 | Original post link

Check if this file has the following configuration?

session required pam_limits.so

translator_bot · June 22, 2024, 9:49pm

| username: 我是咖啡哥 | Original post link

To make the limits.conf file configuration take effect, you must ensure that the pam_limits.so file is added to the startup file.
Check the /etc/pam.d/login file for the following line (add it if it’s not there):
session required /lib64/security/pam_limits.so

Note:
Many people set up limits.conf without testing, and later find that it doesn’t take effect.
The main reasons are:

For 32-bit systems:
session required /lib/security/pam_limits.so
For 64-bit systems:
session required /lib64/security/pam_limits.so

translator_bot · June 22, 2024, 9:49pm

| username: tidb菜鸟一只 | Original post link

Directly use the command ulimit -a with the tidb user to check if the nofile parameter has been set successfully.

translator_bot · June 22, 2024, 9:49pm

| username: TiDBer_ZfFjmcZo | Original post link

It was installed using tiup.

translator_bot · June 22, 2024, 9:49pm

| username: xfworld | Original post link

Is there only this TiKV service instance on this node? Or are there other services as well?

I’m quite puzzled as to why the number of file handles needs to be so high…
65536 is already a lot, but the logs indicate it needs 82920…

translator_bot · June 22, 2024, 9:49pm

| username: TiDBer_ZfFjmcZo | Original post link

It has taken effect. The output of ulimit -n shows 100000.

translator_bot · June 22, 2024, 9:49pm

| username: TiDBer_ZfFjmcZo | Original post link

I’m not sure, isn’t 82920 the number of handles required by TiKV? For example, Elasticsearch requires more than 65535 handles, so I thought TiKV would be the same.

translator_bot · June 22, 2024, 9:49pm

| username: xfworld | Original post link

When installing with tiup, it will help you optimize these parameters. It also has a self-check function, and if the parameters are not properly adjusted, you won’t be able to proceed to the next step…

I suggest you first check the environment of this node. If you don’t understand this, there’s nothing more that can be done.

translator_bot · June 22, 2024, 9:49pm

| username: ohammer | Original post link

You can first execute tiup cluster check to check for potential risks in the cluster.

translator_bot · June 22, 2024, 9:49pm

| username: TiDBer_ZfFjmcZo | Original post link

Tried it, none of them worked.

translator_bot · June 22, 2024, 9:49pm

| username: TiDBer_ZfFjmcZo | Original post link

I have checked, and everything passed except for the CPU mode. Additionally, I adjusted the nofile value for TiDB, which was automatically set to 1,000,000. I changed it to 100,000, which should be fine, right?

translator_bot · June 22, 2024, 9:49pm

| username: ohammer | Original post link

After careful examination, it appears that the issue here is the number of file descriptors rather than the number of open files. There is a difference between the two, as explained in this article: linux文件描述符open file descriptors与open files的区别-CSDN博客. You can use sysctl -a | grep file to check the return value, which should be 65536 as indicated in your logs. The modification method should involve editing the sysctl.conf file, or you can directly use the echo command to implement the change immediately: echo 6553560 > /proc/sys/fs/file-max.

translator_bot · June 22, 2024, 9:49pm

| username: TiDBer_ZfFjmcZo | Original post link

655636 is the nofile value I configured earlier, which is why it appears. I also checked the configuration in /etc/sysctl.conf, and they are all set to 102400. I did not find any other places with the value 65536.

translator_bot · June 22, 2024, 9:49pm

| username: ohammer | Original post link

Oh, anyway, the deployment documentation does require these two. Please check them again. I’ll go to the test environment and try changing the parameters.

translator_bot · June 22, 2024, 9:49pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.