"Become a TiFlash Contributor in Ten Minutes: Essential Knowledge for TiFlash Function Pushdown"

translator_bot · June 23, 2024, 12:45pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 【十分钟成为 TiFlash Contributor】TiFlash 函数下推必知必会

| username: luzizhuo

Author: Huang Haisheng, TiFlash R&D Engineer

Since TiFlash was open-sourced, it has garnered widespread attention from the community. Many enthusiasts have learned about the design principles behind TiFlash through source code reading activities. Additionally, many are eager to contribute to TiFlash, leading to the creation of the “Ten Minutes to Become a TiFlash Contributor” series. We will share everything about TiFlash, from principles to practice!

This article provides detailed information about TiFlash pushdown functions. We have also selected some related issues: https://github.com/pingcap/tiflash/issues/5092. We hope you can complete these challenges after reading this article and earn TiDB Contributor exclusive souvenirs!

Background Knowledge

As an essential part of the TiDB HTAP system, TiFlash receives and executes operators pushed down by TiDB. Sometimes, operators like Projection and Selection contain functions, meaning that to push down these operators, TiFlash must support executing the functions within them.

As shown in the figure above, if an operator contains a function not supported by TiFlash, a series of operators cannot be pushed down to TiFlash for execution. To maximize the parallel computing capabilities of TiFlash MPP, we need TiFlash to support all functions of TiDB. Seemingly trivial function support is a crucial part of TiDB HTAP!

Step-by-Step Guide to Pushdown Functions

1. Confirm the Behavior of the Function to be Pushed Down

The function is pushed down by TiDB to be executed by TiFlash, so the logic executed in TiFlash must be consistent with TiDB, including:

Main logic
Return value type
Exception handling
etc.

For example, the sqrt function in TiDB always returns float64, even if the parameter is of Decimal type, it will internally evaluate the parameter to evalReal. In contrast, floor and ceil will determine the return value type based on the parameter’s type and size.

Generally, it is relatively simple for TiFlash to be consistent with TiDB. However, for some special inputs, special attention is needed during implementation. For example, what should sqrt of a negative number return: NaN, Null, or an exception?

Therefore, before actual development, it is essential to thoroughly review how TiDB implements this function.

2. Map TiDB Function to TiFlash Function

TiDB identifies functions using tipb::ScalarFuncSig, while TiFlash uses func_name as the identifier.

In TiFlash code, we use a mapping table to map tipb::ScalarFuncSig to func_name.

The second step in pushing down a new function is to assign a func_name to the function in TiFlash and add a mapping from tipb::ScalarFuncSig to func_name in the corresponding mapping table.

Typically, SQL functions are divided into window function, aggregate function, distinct aggregation function, and scalar function. TiFlash maintains a mapping table for each type of function, as follows:

window_func_map
- For window functions
agg_func_map
- For regular aggregate functions
distinct_agg_func_map
- For distinct aggregate functions
scalar_func_map
- For general scalar functions

3. Register TiFlash Function

After mapping tipb::ScalarFuncSig to func_name, the function pushed down by TiDB will find the corresponding builder in TiFlash based on func_name. The TiFlash Function will then execute the function logic in the actual execution flow.

Currently, there are two ways to implement Function Builder in TiFlash: reuse function and create function directly.

Reuse Function

Reuse function is used when other functions can be reused. For example, ifNull(arg1, arg2) -> if(isNull(arg1), arg2, arg1). Directly writing an ifNull implementation would be time-consuming, but this method allows reusing other functions’ logic.

In TiFlash, DAGExpressionAnalyzerHelper::function_builder_map records which functions are reused and how to reuse them.

Add a corresponding DAGExpressionAnalyzerHelper::FunctionBuilder and add the mapping <func_name, FunctionBuilder> in DAGExpressionAnalyzerHelper::function_builder_map.

Refer to other FunctionBuilder implementations in DAGExpressionAnalyzerHelper for specific implementation details.

Create Function Directly

Create function directly is used when other functions cannot be reused. Implement the function code under dbms/src/Functions. Usually, there are certain classifications, such as String-related functions in FunctionString.

Then call factory.registerFunction to register the function implementation class in FunctionFactory. factory.registerFunction is usually grouped together, so it should be easy to find.

4. Develop Function on TiFlash Side

Next, develop the main body of the function on the TiFlash side. If existing TiFlash functions cannot be reused, you need to inherit the IFunction interface to develop a function. Fortunately, ClickHouse already has many ready-made functions, but since they may not be compatible with TiDB/MySQL, they are left under Functions for future use.

When inheriting IFunction to implement a function, first check if there is an existing ClickHouse function with the same semantics under Functions. Modify it to meet TiDB/MySQL compatibility and incorporate it into the TiFlash Function system.

If there is no suitable ClickHouse function, develop a vectorized function from scratch. Although developing vectorized functions is relatively challenging, you can find some patterns and development paradigms from other functions.

TiFlash vs. TiDB

There are differences in vectorized function implementation between TiFlash and TiDB. Contributors who have participated in TiDB contributions should note:

Differences between C++ and Golang
- TiFlash heavily uses C++ templates, especially for data type-related code.
Differences in vectorized function systems between TiFlash and TiDB
- The design and usage of expression-related classes differ significantly from TiDB.
  - IDataType
  - IColumn
- The combination of parameter Column types (vector and const) grows exponentially. For example, a function with two parameters has four combinations:
  - vector, const
  - vector, vector
  - const, vector
  - const, const

These differences make function development in TiFlash somewhat challenging and quite different from TiDB. Refer to the implementation of other functions in the Function directory, such as FunctionSubStringIndex. You will have many insights while developing functions

Reference Function Implementations

5. Pushdown Function on TiDB Side

The pushdown function is initiated from the TiDB side, so TiDB also needs some modifications to enable function pushdown. In expression/expression.go, scalarExprSupportedByFlash determines which functions can be pushed down to TiFlash. The TiDB planner decides whether an operator can be pushed down to TiFlash based on scalarExprSupportedByFlash.

For example, to push down the sqrt function to TiFlash, find the scalarExprSupportedByFlash function in TiDB’s expression/expression.go. You will see that all functions that can be pushed down are hard-coded into various switch cases. Add the sqrt function to the switch case.

6. Verify Function Pushdown

After completing the development on both TiDB and TiFlash sides, verify the entire pushdown process locally.

Deploy Local Cluster

Method 1: Use TiUP to Deploy Locally Built TiDB and TiFlash Binaries

First, build the TiFlash and TiDB binaries locally, then use TiUP to start a cluster for testing:

tiup playground nightly --db.binpath ${my_tidb} --tiflash.binpath ${my_tiflash}

By default, this will start a cluster with 1 PD, 1 TiKV, 1 TiDB, and 1 TiFlash. The nightly version is the daily build of the master branch. Use db.binpath and tiflash.binpath to specify the locally built TiDB and TiFlash. Refer to Quickly Deploy TiDB Cluster Locally for more details.

Method 2: Debug Function Execution Process in IDE and Replace TiDB and TiFlash Using Kill

First, start a TiDB, TiKV, TiFlash, and PD cluster locally. Follow the official documentation to install TiUP and start the cluster using playground:

tiup playground nightly

By default, this will start a cluster with 1 PD, 1 TiKV, 1 TiDB, and 1 TiFlash. The nightly version is the daily build of the master branch.

Then replace with the locally built TiDB and TiFlash
TiFlash

ps -ef | grep tiflash to find the TiFlash process, which should look like this:

xzx 11238 11028 52 20:20 pts/0 00:00:05 /home/xzx/.tiup/components/tiflash/v5.0.0-nightly-20210706/tiflash/tiflash server --config-file=/home/xzx/.tiup/data/ScRdWJM/tiflash-0/tiflash.toml

Note the process ID 11238 and the parameters following TiFlash server --config-file=/home/xzx/.tiup/data/ScRdWJM/tiflash-0/tiflash.toml.

Then kill 11238 and start the locally built TiFlash using server --config-file=/home/xzx/.tiup/data/ScRdWJM/tiflash-0/tiflash.toml.

TiDB

Similar to TiFlash, find the TiUP TiDB process, kill the original process, and start TiDB with the corresponding parameters.

Verify Pushdown Process

Use queries like explain select sum(sqrt(x)) from test to see if the function is pushed down to TiFlash for computation.

Create TiFlash replica:

create table test.t (xxx);
-- Since usually only one node is started locally, set TiFlash replica to 1
alter table test.t set tiflash replica 1;

Test SQL can be like this:

-- Prefer MPP
set tidb_enforce_mpp=1;
-- Force to use only TiFlash
set tidb_isolation_read_engines='tiflash';
explain select xxxfunc(a) from t;

If the function is pushed down to TiFlash, the explain result will show the Projection operator containing the function on the TiFlash side. Execute the explain SQL multiple times as TiFlash replica creation takes some time, but not too long. If the function is not pushed down after a long time, there might be an issue.

After the explain SQL executes successfully, remove the explain and execute the SQL to see the effect.

7. Testing

After submitting the PR, the GitHub CI for TiFlash will start an actual TiDB, TiFlash, PD, and TiKV cluster to automatically run unit and integration tests. Contributors need to prepare the test code in advance.

Integration Testing

For function pushdown, usually add a set of tests in the integration-test. Create a func.test for the new pushdown function under tests/fullstack-test/expr, referring to other function tests in the same directory, such as substring_index.test.

Unit Testing

Format

TiFlash function unit tests are placed under dbms/src/Functions/test. The naming format is usually gtest_${func_name}.cpp.

The unit test template is as follows:

#include <TestUtils/FunctionTestUtils.h>
#include <TestUtils/TiFlashTestBasic.h>

namespace DB::tests
{
class {gtest_name} : public DB::tests::FunctionTest
{
};

TEST_F({gtest_name}, {gtest_unit_name})
try
{
    const String & func_name = {function_name};

    // case1
    ASSERT_COLUMN_EQ(
        {ouput_result},
        executeFunction(
            func_name,
            {input_1},
            {input_2},
            ...,
            {input_n},);
    // case2
    ...
    // case3
    ...
}
CATCH

TEST_F({gtest_name}, {gtest_unit_name2})...
TEST_F({gtest_name}, {gtest_unit_name3})...
...

} // namespace DB::tests

Refer to other function unit tests in the directory and make appropriate adjustments.

FunctionTestUtils is a common class for function testing, providing various commonly used methods such as CreateColumn. If you find other reusable methods while writing gtests, you can add them here.

Content

For a function like function(arg_1, arg_2, arg_3, … arg_n), a TiFlash function unit test should at least include the following parts:

Data Types

For each arg_i’s supported types, test Type and Nullable(Type). Although theoretically, all arg_i should support DataTypeNullable(DataTypeNothing), TiDB rarely uses DataTypeNullable(DataTypeNothing), so note related bugs if encountered.

Column Types

For each arg_i’s type:

If the type is not nullable, test two forms of columns:
ColumnVector
ColumnConst
If the type is nullable, test three forms of columns:
ColumnVector
ColumnConst(ColumnNullable(non-null value))
ColumnConst(ColumnNullable(null value))
If the type is DataTypeNullable(DataTypeNothing), test two forms of columns:
ColumnVector
ColumnConst(ColumnNullable(null value))

Boundary Values

Some common boundary value examples are:

Numeric types (int, double, decimal, etc.): max/min values, 0 value, null value
String types: empty string, non-ASCII characters like Chinese, null value, with/without collation
Date types: zero date, dates before 1970-01-01, daylight saving time, null value

For specific functions, construct boundary values based on their specific implementation.

Return Value Types

Ensure TiFlash function return value types are consistent with MySQL/TiDB according to MySQL documentation.

Note:

Decimal types in TiFlash have four internal representations: Decimal32, Decimal64, Decimal128, and Decimal256. Test all four for all Decimal types.
The possible types for each arg_i should be based on the types TiDB might push down. Considering the difficulty of obtaining this information, write tests based on the types currently supported by TiFlash.
Some TiDB pushdown functions have function signatures containing type information, such as EQInt, EQReal, EQString, EQDecimal, EQTime, EQDuration, EQJson for a = b. Although a and b can be int/real/string/decimal/time/duration/json, TiDB ensures a and b are of the same type when pushing down. For now, only test equal functions for the same type, like int = int, decimal = decimal.
For functions with potentially infinite input parameters (e.g., case when), ensure the minimum loop unit is tested.
Expect to find many bugs during testing. Fix easy-to-fix bugs while testing. For difficult or uncertain bugs, open an issue and comment out the corresponding test.

Common Issues

Even if a function returns null, assign a meaningful value to its corresponding nestedColumn

In TiFlash function implementations, there is an overloadable function: useDefaultImplementationForNulls. For most functions, if no special handling for null is needed, return true. This way, no null-related considerations are needed when implementing the function. The principle is that IExecutableFunction::defaultImplementationForNulls will extract the nestedColumn of the nullable column and pass it to the function, and the nestedColumn is always of a not-null type.

For functions requiring special null handling, like concat_ws, which needs

translator_bot · June 23, 2024, 12:45pm

| username: ddhe9527 | Original post link

Support.

translator_bot · June 23, 2024, 12:45pm

| username: luzizhuo | Original post link

Welcome to claim issues and submit PRs together~

translator_bot · June 23, 2024, 12:45pm

| username: 西伯利亚狼 | Original post link

Support it.