TiSpark GC Not Working Properly

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tispark gc 无法正常工作

| username: TiDBer_KkruFifg

【TiDB Usage Environment】Production Environment / Testing / Poc
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issues: Issue Phenomenon and Impact】
【Resource Configuration】Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots / Logs / Monitoring】

  • Cluster Information
    tidb version v5.1.4
    tispark version v3.2.2, upgraded from 2.5

  • Encountered Issues
    Cluster update performance is poor. After analysis and troubleshooting, it was found that GC is not working properly. Below is the GC information:
    tikv_gc_last_run_time | 20231211-15:35:17 +0800
    tikv_gc_safe_point | 20231109-15:34:12 +0800

Below is the information found using pd-ctl:
» service-gc-safepoint
{
“service_gc_safe_points”: [
{
“service_id”: “gc_worker”,
“expired_at”: 9223372036854775807,
“safe_point”: 446241575371538432
},
{
“service_id”: “ticdc”,
“expired_at”: 1702363646,
“safe_point”: 446236079548530688
},
{
“service_id”: “tispark_161d033e-79ed-4d86-920e-2421541cbcee”,
“expired_at”: 1702277526,
“safe_point”: 445841422866710913
},
{
“service_id”: “tispark_31b3b228-3e1c-4719-b103-8313a8d54ed6”,
“expired_at”: 1702277538,
“safe_point”: 445786124389712129
},
{
“service_id”: “tispark_36ce4519-3041-4d11-a23f-db53a6f484eb”,
“expired_at”: 1702277568,
“safe_point”: 445991471693693035
},
{
“service_id”: “tispark_4c225448-330f-448f-bd32-9a858963888c”,
“expired_at”: 1702277584,
“safe_point”: 445518167438000131
},
{
“service_id”: “tispark_af3b67b4-8e0c-42b2-8d0f-86bd77091e6a”,
“expired_at”: 1702277569,
“safe_point”: 445991471995158529
},
{
“service_id”: “tispark_c42ced32-87be-4ac5-b035-9c75fd8e1e3e”,
“expired_at”: 1702277571,
“safe_point”: 445991472873340930
},
{
“service_id”: “tispark_cbe81364-7e10-402c-b41d-709469892090”,
“expired_at”: 1702277571,
“safe_point”: 445518384558768521
},
{
“service_id”: “tispark_f58f0569-6e9b-41b6-93f8-c311ee179142”,
“expired_at”: 1702277564,
“safe_point”: 445536642717974864
},
{
“service_id”: “tispark_fdd4b490-6675-4168-b812-7c25b79d6858”,
“expired_at”: 1702277580,
“safe_point”: 445517726367612929
}
],
“gc_safe_point”: 445517726367612929
}

Below is the timestamp conversion of the above information:
tso 446241575371538432 # 2023-12-11 14:35:17.378 +0800 +08
tso 446236079548530688 # 2023-12-11 08:45:52.477 +0800 +08
tso 445991471693693035 # 2023-11-30 13:34:07.562 +0800 +08
tso 445991471995158529 # 2023-11-30 13:34:08.712 +0800 +08
tso 445991472873340930 # 2023-11-30 13:34:12.062 +0800 +08
tso 445841422866710913 # 2023-11-23 22:34:16.712 +0800 +08
tso 445786124389712129 # 2023-11-21 11:58:29.763 +0800 +08
tso 445536642717974864 # 2023-11-10 11:36:52.712 +0800 +08
tso 445517726367612929 # 2023-11-09 15:34:12.562 +0800 +08
tso 445518167438000131 # 2023-11-09 16:02:15.112 +0800 +08
tso 445518384558768521 # 2023-11-09 16:16:03.362 +0800 +08

Below are some tidb logs:
[2023/12/11 15:56:18.370 +08:00] [INFO] [gc_worker.go:292] [“[gc worker] there’s already a gc job running, skipped”] [“leaderTick on”=630cd6166400013]
[2023/12/11 15:57:17.373 +08:00] [INFO] [gc_worker.go:292] [“[gc worker] there’s already a gc job running, skipped”] [“leaderTick on”=630cd6166400013]
[2023/12/11 15:58:17.424 +08:00] [INFO] [gc_worker.go:292] [“[gc worker] there’s already a gc job running, skipped”] [“leaderTick on”=630cd6166400013]
[2023/12/11 15:59:17.662 +08:00] [INFO] [gc_worker.go:292] [“[gc worker] there’s already a gc job running, skipped”] [“leaderTick on”=630cd6166400013]
[2023/12/11 16:00:17.450 +08:00] [INFO] [gc_worker.go:292] [“[gc worker] there’s already a gc job running, skipped”] [“leaderTick on”=630cd6166400013]
[2023/12/11 16:00:18.957 +08:00] [INFO] [gc_worker.go:1046] [“[gc worker] finish resolve locks”] [uuid=630cd6166400013] [safePoint=445517726367612929] [regions=1450275]

Here are a few questions, please help take a look, thank you:

  1. How to view the corresponding changefeed-id based on the service_id output by pd-ctl? Knowing this information will facilitate task deletion.
  2. How to delete tispark tasks? It seems there is no way to delete them. This issue has also been mentioned on asktug without a solution. Please help take a look, it can also help others encountering similar issues.
  3. Why does GC default to retaining data for 24 hours, but data from a month ago has not been properly GC’d? Below is the official documentation explanation:
    TiCDC 常见问题解答 | PingCAP 文档中心
    The second behavior above was added in TiCDC v4.0.13 and later versions. The purpose is to prevent a TiCDC synchronization task from stalling for too long, causing the upstream TiKV cluster’s GC safepoint to not advance for a long time, retaining too many old data versions, and thus affecting the performance of the upstream cluster.
    专栏 - TiSpark v3.0.3 & v3.1.3 发布 | TiDB 社区
| username: dba远航 | Original post link

Through [gc worker] there’s already a GC job running, it can be concluded that GC is not not running, but has been running continuously without finishing. There might be significant changes causing the GC to never complete.

| username: Jellybean | Original post link

The issue of TiSpark getting stuck at the GC safe point, see if this post is helpful to you:

| username: TiDBer_KkruFifg | Original post link

Hello, I have carefully read the post you sent, but it does not provide a solution, that is:
We are using the tispark jar package, included in the spark task. How can we delete the tispark task in this case?
Is it possible to delete it through pd-ctl or any other method? Thanks.
I know how to delete cdc tasks, and there are clear instructions in the official documentation.
But for tispark tasks, the official documentation does not specify.

| username: TiDBer_KkruFifg | Original post link

  • pd-ctl does not support deletion in version 5.1.4
    image

  • The service_id of tispark is random, and no information can be traced through this service_id

  • Please advise on how to delete the tispark task so that tikv_gc_safe_point can proceed normally, thank you

| username: TiDBer_KkruFifg | Original post link

@neilshen Does TiSpark also have similar CDC commands that can be used as shown below?

tiup cdc:v5.1.0 cli --pd=<PD_ADDRESS> unsafe reset

| username: Jellybean | Original post link

TiSpark is just a thin connector layer between TiKV and the Spark cluster. The tasks you submit should be implemented on the Spark side, so task management operations are also performed on the client side for lookup and deletion.

Are you using YARN to manage Spark cluster tasks? If so, you can go to the YARN master node or the management platform page to find the corresponding task ID, and then execute a command like yarn kill application-id to delete the task. The process is similar for tasks submitted using spark-submit.

| username: tidb狂热爱好者 | Original post link

Unsafe reset seems to be the start of CDC.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.