Issues with TiDB Upgrades

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb升级问题

| username: l940399478

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.1.4
During the TiDB upgrade process, if you encounter an error that cannot be resolved temporarily, can you replace the bin directories of each component with the original version and restart the cluster? What changes are made during a TiDB upgrade?

| username: l940399478 | Original post link

Planning to upgrade the version from v5.1.4 to v6.5.2.

| username: 像风一样的男子 | Original post link

Check the official upgrade guide:

Also, why is it 6.5.2 instead of 6.5.5?

| username: l940399478 | Original post link

The previous upgrade was to version 6.5.2, as 6.5.5 was not available at that time. To maintain consistency, we only upgraded to 6.5.2.

| username: l940399478 | Original post link

I have upgraded before, but I never really understood what changes during the upgrade process. Visually, it just looks like the files in the bin directory are replaced.

| username: Fly-bird | Original post link

It can be upgraded automatically. During the upgrade process, TiDB will automatically back up the installation package. You can also manually download and install the package, but that is a bit more troublesome.

| username: 啦啦啦啦啦 | Original post link

It is not recommended to do this. If considering a rollback plan, it is best to set up a new version of the cluster and use TiCDC to synchronize data for traffic switching. This way, you can roll back.

| username: tidb菜鸟一只 | Original post link

Using this method, upgrading TiFlash from 5.1 to 6.5 and then directly replacing the TiFlash component might not work. It is recommended to set up two clusters with CDC synchronization for the upgrade if you have spare machines. If not, you should back up the data first in case you need to roll back, as it is safer to restore the data after rolling back.

| username: xmlianfeng | Original post link

I have upgraded a production cluster from 5.0.6 to 6.1.5 and then to 6.5.3, almost got into big trouble.
If the cluster data is not large, I suggest following what the previous guy said: create a new cluster, restore the data using BR, and then use CDC.

| username: 像风一样的男子 | Original post link

If the data is not large, make a backup in advance and schedule a downtime for the upgrade.

| username: TiDBer_小阿飞 | Original post link

Improvements

  • TiDB
    • Improved the speed of TRUNCATE operations on partitioned tables with Placement Rules #43070 @Lloyd-Pottiger
    • Avoided unnecessary Stale Read retries after resolving locks #43659 @you06
    • Used leader read to reduce latency when encountering DataIsNotReady errors in Stale Read #765 @Tema
    • Added Stale Read OPS and Stale Read MBps metrics for monitoring hit rate and traffic in Stale Read #43325 @you06
  • TiKV
    • Used gzip compression for check_leader requests to reduce traffic #14839 @cfzjywxk
  • PD
    • Used a separate gRPC connection for PD Leader election to prevent interference from other requests #6403 @rleungx
  • Tools
    • TiCDC
      • Optimized TiCDC’s handling of DDL to prevent it from blocking unrelated DML events and reduce memory usage #8106 @asddongmen
      • Adjusted the Decoder interface to add a new method AddKeyValue #8861 @3AceShowHand
      • Optimized the directory structure when a DDL event occurs while synchronizing data to object storage #8890 @CharlesCheung96
      • Supported synchronization to Kafka-on-Pulsar downstream #8892 @hi-rustin
      • Supported OAuth protocol verification when synchronizing data to Kafka #8865 @hi-rustin
      • Optimized the handling of UPDATE statements when synchronizing data using Avro or CSV protocols by splitting them into DELETE and INSERT statements, allowing users to obtain the old value from the DELETE statement #9086 @3AceShowHand
      • Added a configuration item insecure-skip-verify to control whether to set the authentication algorithm when enabling TLS in the Kafka synchronization scenario #8867 @hi-rustin
      • Optimized DDL synchronization operations in TiCDC to reduce the impact of DDL operations on downstream latency #8686 @hi-rustin
      • Optimized TiCDC’s method of setting upstream GC TLS when synchronization tasks fail #8403 @charleszheng44
    • TiDB Lightning
      • Added a retry mechanism when encountering unknown RPC errors during data import #43291 @D3Hunter
    • TiDB Binlog
      • Optimized the method of obtaining table information to reduce Drainer’s initialization time and memory usage #1137 @lichunzhu

Bug Fixes

  • TiDB
    • Fixed an issue with incorrect results in min, max queries #43805 @wshwsh12
    • Fixed an issue with incorrect execution plans when pushing down window function calculations to TiFlash #43922 @gengliqi
    • Fixed an issue where queries using CTE could cause TiDB to hang #43749 #36896 @guo-shaoge
    • Fixed an issue where using the AES_DECRYPT expression could cause a runtime error: index out of range SQL error #43063 @lcwangchao
    • Fixed an issue where the SHOW PROCESSLIST statement could not display the TxnStart of long-running subqueries #40851 @crazycs520
    • Fixed an issue where PD isolation could cause running DDL to be blocked #44014 #43755 #44267 @wjhuang2016
    • Fixed an issue where TiDB could panic when using UNION queries combining views and temporary tables #42563 @lcwangchao
    • Fixed an issue with Placement Rule behavior under partitioned tables, ensuring deleted partition Placement Rules can be correctly set and recycled #44116 @lcwangchao
    • Fixed an issue where truncating a partitioned table could invalidate the partition’s Placement Rule #44031 @lcwangchao
    • Fixed an issue where TiCDC could lose some row changes during table renaming #43338 @tangenta
    • Fixed an issue where DDL job history records could be lost after importing tables using BR #43725 @tangenta
    • Fixed an issue where JSON_OBJECT could error out in certain situations #39806 @YangKeao
    • Fixed an issue where clusters in IPv6 environments could not query some system views #43286 @Defined2014 @nexustar
    • Fixed an issue where ID allocation for AUTO_INCREMENT columns could be blocked for a long time when PD member addresses changed #42643 @tiancaiamao
    • Fixed an issue where TiDB could send duplicate requests to PD when recycling placement rules, causing a large number of full config reset logs in PD #33069 @tiancaiamao
    • Fixed an issue where the SHOW PRIVILEGES command displayed an incomplete list of privileges #40591 @CbcWestwolf
    • Fixed an issue where ADMIN SHOW DDL JOBS LIMIT returned incorrect results #42298 @CbcWestwolf
    • Fixed an issue where user creation failed when tidb_auth_token users were checked under password strength verification #44098 @CbcWestwolf
    • Fixed an issue where partitions could not be found when inner joining tables in dynamic pruning mode #43686 @mjonss
    • Fixed an issue where MODIFY COLUMN on partitioned tables could output Data Truncated related errors #41118 @mjonss
    • Fixed an issue where TiDB addresses were displayed incorrectly in IPv6 environments #43260 @nexustar
    • Fixed an issue where CTE results could be incorrect under predicate pushdown #43645 @winoros
    • Fixed an issue where using common table expressions (CTE) with non-correlated subqueries could lead to incorrect results #44051 @winoros
    • Fixed an issue where Join Reorder could cause incorrect results for Outer Join #44314 @AilinKid
    • Fixed an issue where resolving locks in pessimistic transactions could affect transaction correctness when the first statement of the transaction was retried #42937 @MyonKeminta
    • Fixed an issue where residual pessimistic locks in pessimistic transactions could affect data correctness during GC resolve lock in rare cases #43243 @MyonKeminta
    • Fixed an issue where batch cop scan detail information was inaccurate during execution #41582 @you06
    • Fixed an issue where TiDB could not read data updates when using Stale Read and PREPARE statements simultaneously #43044 @you06
    • Fixed an issue where LOAD DATA statements could incorrectly report assertion failed #43849 @you06
    • Fixed an issue where Stale Read could not fallback to the leader when the coprocessor encountered region data not ready #43365 @you06
  • TiKV
    • Fixed a file handle leak issue in Continuous Profiling #14224 @tabokie
    • Fixed an issue where PD crashes could prevent PITR from progressing #14184 @YuJuncen
    • Fixed an issue where encryption Key ID conflicts could cause old keys to be deleted #14585 @tabokie
    • Fixed an issue where autocommit and point get replica read could break linear consistency #14715 @cfzjywxk
    • Fixed an issue where cumulative Lock records could degrade performance when upgrading clusters from lower versions to v6.5 or higher #14780 @MyonKeminta
    • Fixed an issue where TiDB Lightning could cause SST file leaks #14745 @YuJuncen
    • Fixed an issue where potential conflicts between encryption keys and raft log file deletions could prevent TiKV from starting #14761 @Connor1996
  • TiFlash
    • Fixed a performance degradation issue with the TableScan operator on partitioned tables during Region migration #7519 @Lloyd-Pottiger
    • Fixed an issue where queries could error out when GENERATED columns and TIMESTAMP or TIME types coexisted in TiFlash #7468 @Lloyd-Pottiger
    • Fixed an issue where large update transactions could cause TiFlash to repeatedly error and restart #7316 @JaySon-Huang
    • Fixed an issue where INSERT SELECT statements could error out with “Truncate error cast decimal as decimal” when reading data from TiFlash #7348 @windtalker
    • Fixed an issue where queries with large data on the Join build side and many small string type columns could consume more memory than necessary #7416 @yibin87
  • Tools
    • Backup & Restore (BR)
      • Fixed an issue where BR’s error message “resolve lock timeout” was misleading and masked the actual error when backup failed #43236 @YuJuncen
    • TiCDC
      • Fixed an OOM issue when the number of tables reached 50,000 #7872 @sdojjy
      • Fixed an issue where TiCDC could hang when upstream TiDB experienced OOM #8561 @overvenus
      • Fixed an issue where TiCDC could hang during PD network isolation or PD Owner node restart #8808 #8812 #8877 @asddongmen
      • Fixed a timezone setting issue in TiCDC #8798 @hi-rustin
      • Fixed an issue where checkpoint lag increased when upstream TiKV nodes crashed #8858 @hicqu
      • Fixed an issue where synchronization to downstream MySQL could fail after executing FLASHBACK CLUSTER TO TIMESTAMP in upstream TiDB #8040 @asddongmen
      • Fixed an issue where EXCHANGE PARTITION operations in upstream were not properly synchronized to downstream when synchronizing data to object storage #8914 @CharlesCheung96
      • Fixed an issue where the sorter component could use excessive memory in certain special scenarios, leading to OOM #8974 @hicqu
      • Fixed an issue where TiCDC queried downstream metadata too frequently, causing high load on downstream when the downstream was Kafka #8957 #8959 @hi-rustin
      • Fixed an issue where the message body was logged when synchronization failed due to large Kafka messages #9031 @darraes
      • Fixed an issue where TiCDC nodes could panic during downstream Kafka rolling restarts [#9023](https://github.com/pingcap/t
| username: 路在何chu | Original post link

Indeed, migration and upgrade are relatively safe.

| username: zhanggame1 | Original post link

It is possible that data storage may change in some versions, not just the contents inside the bin.

| username: wakaka | Original post link

No, although it looks like the bin file has been replaced, some logic and underlying storage structures have changed. It cannot be rolled back.

| username: 喵父666 | Original post link

It’s better to migrate and upgrade.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.