TiDB in the AI Era: Simplified

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ai时代的tidb,简单

| username: tidb狂热爱好者

In the AI era, TiDB has also introduced its own AI functions. The world is progressing so fast that I can’t keep up.

mysql> CREATE TABLE vec_tbl(content_id INT, vec VECTOR);
Query OK, 0 rows affected (1.24 sec)

mysql> CREATE TABLE vec_ref_id(id INT, content VARCHAR(512));
Query OK, 0 rows affected (0.86 sec)

# insert

….

# TOPK

mysql> SELECT content_id, vec_cosine_distance(vec, '[8.7, 5.7, 7.7, 9.8, 1.4]') as distance
-> FROM vec_tbl ORDER BY distance LIMIT 3;
+------------+-------------------------+
| content_id | distance                |
+------------+-------------------------+
| 1          | 0.000018636164854313186 |
| 4          | 0.046720443178103865    |
| 5          | 0.17025852918553686     |
+------------+-------------------------+
3 rows in set (0.26 sec)

# JOIN

mysql> SELECT vri.id, vri.content, VEC_Cosine_Distance(vt.vec, '[1,2,3,4,5]') AS distance
-> FROM vec_tbl vt
-> JOIN vec_ref_id vri ON vt.content_id = vri.id
-> ORDER BY distance;
+------+-------------+---------------------+
| id   | content     | distance            |
+------+-------------+---------------------+
| 5    | content 5   | 0.04972827044697814 |
| 3    | content 3   | 0.06925255631855243 |
| 4    | content 4   | 0.12677426832319694 |
| 2    | content 2   | 0.18577333207371582 |
| 1    | hello world | 0.2564100235012563  |
+------+-------------+---------------------+
5 rows in set (0.29 sec)

With vectors, there’s no need to write code for movie recommendations or book recommendations anymore. Just one SQL command can handle it.

KNN (K Nearest Neighbors)

  • Precisely finds the K nearest vectors from N-dimensional vectors
  • Time complexity is O(NK)

ANN (Approximate Nearest Neighbors)

  • Approximate results, so there are recall issues
  • K approximate optimal results, mapped to SQL operators as TopN
  • Expected query latency needs to be close to TP (< 100ms)

Hash-based

  • LSH: Multiple hashes increase collisions, clustering similar items together; not suitable for high-dimensional data

Tree-based:

  • KD-Tree, Annoy

Graph-based:

  • HNSW, DiskANN

Cluster-based

mysql> CREATE TABLE vec_tbl(content_id INT, vec VECTOR);
Query OK, 0 rows affected (1.24 sec)

mysql> CREATE TABLE vec_ref_id(id INT, content VARCHAR(512));
Query OK, 0 rows affected (0.86 sec)

# insert

….

# TOPK

mysql> SELECT content_id, vec_cosine_distance(vec, '[8.7, 5.7, 7.7, 9.8, 1.4]') as distance
-> FROM vec_tbl ORDER BY distance LIMIT 3;
+------------+-------------------------+
| content_id | distance                |
+------------+-------------------------+
| 1          | 0.000018636164854313186 |
| 4          | 0.046720443178103865    |
| 5          | 0.17025852918553686     |
+------------+-------------------------+
3 rows in set (0.26 sec)

# JOIN

mysql> SELECT vri.id, vri.content, VEC_Cosine_Distance(vt.vec, '[1,2,3,4,5]') AS distance
-> FROM vec_tbl vt
-> JOIN vec_ref_id vri ON vt.content_id = vri.id
-> ORDER BY distance;
+------+-------------+---------------------+
| id   | content     | distance            |
+------+-------------+---------------------+
| 5    | content 5   | 0.04972827044697814 |
| 3    | content 3   | 0.06925255631855243 |
| 4    | content 4   | 0.12677426832319694 |
| 2    | content 2   | 0.18577333207371582 |
| 1    | hello world | 0.2564100235012563  |
+------+-------------+---------------------+

Databases are inherently based on statistical needs. Before machine learning, when I was working on a download site, we would typically display the top 10 recent downloads, top 10 historical downloads, and top 10 category downloads on the left page using SQL. With vector functions, we can now display “users who liked this software also downloaded xxx,” and then use the beer-diaper principle to make vector recommendations.

import numpy as np

# create user preferences
user_pref = np.array([5, 1, 3])

# create a random movie matrix of 10,000 movies
movies = np.random.randint(5, size=(3, 10000)) + 1

# Note that the randint will make random integers from 0-4
# so I added a 1 at the end to increase the scale from 1-5

We use the array function of the numpy module to create matrices. user_pref and movies are the data we obtained. We use the shape method of the numpy module to check the dimensions of the matrix, as shown below:

import time   

for num_movies in (10000, 100000, 1000000, 10000000, 100000000):
  movies = np.random.randint(5, size=(3, num_movies)) + 1
  now = time.time()
  np.dot(user_pref, movies)
  print(time.time() - now), "seconds to run", num_movies, "movies"

Output result:

0.000160932540894 seconds to run 10000 movies
0.00121188163757 seconds to run 100000 movies
0.0105860233307 seconds to run 1000000 movies
0.096577167511 seconds to run 10000000 movies
4.16197991371 seconds to run 100000000 movies

In other words, previously we needed Python and numpy to write recommendation code, calculate and sort, and then write into TiDB for caching. Now, one SQL command can handle it:

SELECT vri.id, vri.content, VEC_Cosine_Distance(vt.vec, '[1,2,3,4,5]') AS distance
FROM vec_tbl vt JOIN vec_ref_id vri ON vt.content_id = vri.id ORDER BY distance;
| username: oceanzhang | Original post link

There is still a long way to go.

| username: zhang_2023 | Original post link

The development is quite fast.

| username: 数据库真NB | Original post link

This is not AI, at most it integrates a few algorithms.

| username: shigp_TIDBER | Original post link

This still requires a lot of verification and proof, right?

| username: ShawnYan | Original post link

Well, it’s not just AI.

| username: Kongdom | Original post link

When will the 550W be released?

| username: ShawnYan | Original post link

Cutting-edge content needs to be quickly implemented, tested, and improved.

| username: Jack-li | Original post link

Forge ahead

| username: 洪七表哥 | Original post link

Looking forward to more algorithm support.

| username: TiDBer_JUi6UvZm | Original post link

Got it!

| username: ZZH-QK | Original post link

:+1: Are you also expanding into vector databases?

| username: 呢莫不爱吃鱼 | Original post link

It feels like there’s still a long way to go.

| username: jiayou64 | Original post link

:+1: :+1:

| username: zhaokede | Original post link

:+1:

| username: 随缘天空 | Original post link

Is this AI?

| username: xiaoqiao | Original post link

Awesome

| username: wangkk2024 | Original post link

Awesome

| username: dba远航 | Original post link

Welcome creative ideas