PCTA Study Notes and Exercises

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PCTA学习笔记练习

| username: TiDBer-lucy


1. What are the driving forces behind the development of data technology?

a. Data volume growth due to business development

b. The demand for “data interaction efficiency and data model diversity” driven by scenario innovation

c. Development of hardware and cloud computing

2. What problems did Google’s 2006 troika GFS, BigTable, and MapReduce solve respectively?

GFS solved the distributed file system problem

BigTable solved the distributed key-value storage problem

MapReduce solved how to perform computation and analysis on distributed file systems and KV storage

3. The essence of the data technology stack direction and data products is:

To address different business scenarios, combining relatively fixed basic technical elements, and making various data technology and architecture trade-offs (choices and balances)

4. What are the main challenges of distributed technology?

  1. How to maximize the implementation of divide and conquer

  2. How to achieve global consistency

  3. How to perform fault tolerance and partial failure handling

  4. How to deal with unreliable networks and network partitions

5. What theories are related to TiKV?

Google Spanner’s multi-raft-group replica mechanism design

Google F1 paper algorithm implementation of DDL

6. Characteristics of the distributed transaction model

Decentralized two-phase commit

Google Percolator transaction model

TiKV supports a complete transaction KV API

Default optimistic transaction model (version 3.0+ also supports pessimistic transaction model)

Default isolation level: Snapshot Isolation

7. What is a Coprocessor?

A coordinator. It is a module in TiKV that reads data and performs calculations. Each TiKV storage node has a coordinator.

8. How to build an Online DDL algorithm?

a. There is no concept of sharding in TiDB, so the entire DDL completion process is very fast

b. Divide the DDL process into several states such as Public, Delete-only, Write-only, etc., synchronize and achieve consistency among multiple nodes in each state, and finally complete the entire DDL.

9. TiFlash row-column data synchronization?

Accessed in Raft Leaner mode to the Multi-Raft group, using asynchronous data transmission, not participating in Raft elections, causing very little burden on TiKV, while also avoiding the split-brain problem of the replication group.

10. What are the characteristics of the MPP introduced to solve the AP computing power problem?

TiDB-Server acts as the entry node, deciding whether to compute via MPP mode based on cost

TiFlash acts as the MPP computing node

Subsequent introduction of MPP row storage reading capability

11. What HTAP features have been implemented?

TiDB introduced a real-time updated columnar engine, solving resource isolation and improving AP efficiency

Introduced MPP model on column storage, achieving SQL join pushdown and parallel processing

Achieved timeliness through Raft-based replication

Integrated big data ecosystem, such as TiSpark

TiDB’s CBO can collect row-column cost models for configuration and synchronously collect statistics from different engines to uniformly select the best execution path

12. Six goals of distributed relational database design:

Horizontal scalability, strong consistency and high availability, standard SQL and transaction model, cloud-native, HTAP, ecosystem compatibility

13. The main reason for promoting the separation of computing and storage architecture is?

Hardware, especially the network

Storage Engine

14. Characteristics of the TiKV storage engine?

a. Finer-grained elastic scaling

b. High concurrent read and write

c. No data loss or errors

d. Multi-replica ensures consistency and high availability

e. Supports distributed transactions

The LSM-tree structure essentially uses space to replace write latency, using sequential writes to replace random writes

RockDB is a very mature LSM-tree storage engine, supporting batch writes and lock-free snapshot reads

Range sharding can more efficiently scan data records

Range sharding can easily achieve automatic splitting and merging

Elastic priority, shards can be freely scheduled as needed

15. The architecture of the TiKV storage engine from bottom to top is?

RocksDB → Raft → Transaction → TiKV API → Coprocessor API

16. Multi-Raft makes replication groups more discrete

Roles include leader, follower, learner

Strong leader mode, read and write on the leader

Follower read enabled in version 4.0

Region-based Multi-Raft achieves multiple write points for a table simultaneously, TiKV’s scheduling mechanism can recognize the physical information of a single node (IDC, RACK, HOST), and perform process constraints and binding.

Computing Engine

16. What are the frontend functions of TiDB-server?

a. Manage connections and account permissions

b. MySQL protocol encoding and decoding

c. Independent SQL execution

d. Database table metadata and system variables

17. What are the backend functions of TiDB-Server?

a. Garbage collection (GC)

b. Execute DDL

c. Statistics management

d. SQL optimizer and executor

18. The architecture of TiDB-Server from top to bottom is?

MySQL Protocol Layer → Core SQL Layer (TiDB SQL Layer) → Storage Interface Layer (KV API, DistSQL API)


19. Automatic sharding technology is the basis for updating dimension elasticity

a. Globally ordered KV map

b. Automatic sharding based on equal-length size strategy (96M)

c. Each shard is a continuous KV, addressed by Start/End key

d. Fixed seek cost for each shard

e. Region is the smallest unit of replication and scheduling

Automatic merge

Default 96M shard

144M split shard

20MB merge shard

Cluster Management

21. TiDB Cluster startup sequence

PD → TiKV → TiDB → TiFlash

22. Which of the following statements about TiUP are correct? (B, D)

A. TiUP is a package manager introduced from TiDB 5.0

B. TiDB, PD, and TiKV components can all be managed by TiUP

C. When executing TiUP, the command and component cannot appear simultaneously

D. TiUP can not only be responsible for TiDB deployment but also for TiDB Cluster management

23. What clients/GUI can connect to the TiDB database?

MySQL client, Mycli, MySQL Workbench, Navicat, phpMyAdmin

System Parameters/Cluster Parameters

24. Which of the following statements about TiDB system parameters are correct? (ACE)

A. System parameters have different scopes

B. System parameters refer to parameters about TiDB, TiKV, and PD that are persisted in KV

C. System parameters can be modified through the MySQL client

D. SESSION-level system parameter modifications need to reconnect to take effect

E. GLOBAL-level system parameters, after modification, are invalid for the current session

F. GLOBAL-level system parameters, after modification, are effective for the current session

25. Which of the following statements about TiDB cluster configuration parameters are correct? (A)

A. TiKV node cluster configuration parameter modifications need to restart the node to take effect

B. Cluster configuration parameters have scope ranges

C. Cluster configuration parameters can be modified using the MySQL client

D. Cluster configuration parameters are persisted in the configuration files of TiDB, TiKV, and PD, and some are persisted in KV storage

E. None of the above

User Management

26. Differences between roles and users include?

a. Roles are locked

b. Roles have no passwords

c. Stored in the mysql.user table

d. After a user logs in, they must use the set role all command to enable the roles assigned to the user

27. Which of the following statements about roles and user accounts are correct? (D)

A. Roles and user accounts are both composed of names and hostnames, and both can log in to the database

B. When creating a role, if the hostname is not specified, the default is localhost

C. When creating a user, if the hostname is %, it is meaningless

D. Roles have no passwords

File Management

28. Which of the following statements about TiDB database files are correct? (C)

A. Only TiDB and TiKV nodes have data information, PD nodes do not store data information

B. TiKV and TiDB configuration files are stored in the /tidb-deploy folder of PD nodes

C. KV metadata of PD nodes is stored in the /tidb-data directory

D. None of the above


29. Which of the following statements about TiDB monitoring are correct? (C)

A. Prometheus + Grafana + Dashboard is built into the PD component, providing monitoring and alerting services

B. TiDB’s alert handling requires additional third-party plugins

C. TiDB’s alert items are divided into three levels: emergency, critical, and warning

D. None of the above


30. Are the operations for scaling and shrinking TiDB/TiKV/PD/TiFlash basically the same? (B)

A. Correct

B. Incorrect

31. The command tiup cluster clean ${cluster-name} --log is used to clean cluster data while retaining log information? (B)

A. Correct

B. Incorrect

32. Which of the following statements about upgrading the TiDB cluster is incorrect? (C)

A. If the evict leader wait time is too long when upgrading the TiDB cluster, it can be resolved with the --force parameter

B. When upgrading the TiDB cluster, you can stop the cluster, upgrade the cluster, and then start the cluster

C. Upgrading the TiDB cluster must be done offline

D. Upgrading the TiDB cluster does not support rollback

Backup and Restore

33. Which of the following statements about hot backup are correct? (BD)

A. Hot backup requires the database to be in read-only mode

B. Hot backup does not lock any user’s read and write operations

C. Compared to cold backup, hot backup is faster

D. In the TiDB ecosystem tools, BR’s backup is a hot backup

34. Regarding logical backup and physical backup, which of the following statements are correct? (C)

A. Both logical backup and physical backup can be applied to data migration between heterogeneous databases

B. Logical backup is more suitable for large data backups compared to physical backup

C. Both BR tools and operating system copies are physical backups

D. Logical backup is often more efficient than physical backup

35. Which of the following statements about BR are incorrect? (ABC)

A. BR tools are logical backups

B. When restoring with BR tools, the database needs to be set to read-only mode, otherwise data inconsistency may occur

C. When restoring BR tool backups, TiCDC can be used simultaneously to synchronize to downstream TiDB or MySQL databases

D. When the backup data volume is large, and the database cannot be stopped while requiring backup efficiency, BR tools have more advantages than Dumpling and operating system copies

36. Which backup method is supported by BR tools? (A)

A. Hot backup, physical backup

B. Cold backup, physical backup

C. Hot backup, logical backup

D. Cold backup, physical backup

37. Dumpling is suitable for the following scenarios:

  1. Exporting small amounts of data

  2. Scenarios requiring the export of SQL statements or CSV, which can be migrated to heterogeneous databases or systems

  3. Scenarios with low export efficiency requirements, as reading data and conversion make it less efficient than physical export

38. Dumpling is not suitable for the following scenarios:

  1. Need to directly export the original SST (key-value pairs) from the TiDB database

  2. Incremental backup, currently only full export is possible, incremental is not achievable

  3. Scenarios with large data volumes

39. Dumpling required permissions

Minimum required permissions





40. Which of the following scenarios are not suitable for Dumpling tool export? (AC)

A. Incremental backup

B. Exporting data to MySQL database

C. Scenarios with large data volumes requiring hot backup

D. Only exporting 2 tables with millions of records

| username: shigp_TIDBER | Original post link

The summary is very detailed and took quite some effort.