[TiDBer Chat Session 115] TiDB Supports Vector Functions, What Do You Want to Use It For?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 【TiDBer 唠嗑茶话会 115】TiDB 支持向量功能,你最想拿它做什么?

| username: Billmay表妹

This week, I saw many friends sharing @hey-hoho’s article in their Moments

Everyone is very excited about TiDB’s vector support feature.

Currently, TiDB serverless already supports vector functionality!

Trial Entry

Waitlist application entry: https://tidb.cloud/ai

Experience entry: https://tidbcloud.com

This Topic:

What do you most want to do with TiDB’s vector support feature?
Do you have a need for TiDB’s vector support?
In what scenarios would you use this feature?
What are your thoughts or suggestions on the future launch of TiDB’s vector functionality?

Interpretation of vector from @Icemap on the 57th floor:

I’ll start the discussion here. We are currently using the Vector feature in a specific RAG scenario.
RAG stands for retrieval-augmented generation. This is the type of Apps we refer to as retrieval-augmented-generation, which is the most common method of constructing Apps in the current AI field and is widely used.

For example, if you have used ChatGPT, you will be impressed by its ability to speak nonsense seriously. This is called the “hallucination” of LLM, a situation where it loses attention.
So how can we simply solve this problem? The answer is RAG. We can first retrieve some reliable materials for the question, provide LLM with some reliable context as augmentation, and then let it generate new content. This can significantly improve the accuracy of the answers.

Here is a more specific example. We are working on tidb.ai, hoping to use large models and our own documentation to answer questions about TiDB. The steps can be as follows:

  1. The user asks a question about TiDB
  2. Search for relevant TiDB documents based on the question
  3. Use the documents to fill in the LLM’s Prompt
  4. Let the LLM generate the output in the required format

Here comes a problem, how to search for relevant TiDB documents based on the question? This brings up another AI feature called Embedding. It generates a feature vector for a piece of text, and we can compare the Embedding Vector distances of two pieces of text. The closer the distance, the more relevant the two pieces of text are. Therefore, we can optimize the above process as follows:

  1. Pre-embed TiDB documents, store the text and its corresponding Vector in TiDB
  2. The user asks a question about TiDB, performs Embedding, and uses the vector distance calculation in TiDB to search for the most relevant documents in TiDB
  3. Use the documents to fill in the LLM’s Prompt
  4. Let the LLM generate the output in the required format

This is one of the reasons why we need the Vector feature. Additionally, how to generate Embedding and how to Retrieve are optimization points. This is also the part we are working on. However, the Vector feature of TiDB is indispensable here. Here is a comparison chart if we use an external Vector database versus TiDB’s built-in Vector.

Of course, this is just one use case of Vector. We look forward to more usage methods from everyone.

Participation Rewards:

Participate in the discussion and get 30 points & experience!

Event Time:

2024.4.19 - 2024.4.25

| username: 随缘天空 | Original post link

I hope to provide a general overview of vector functions, including scenarios and advantages.

| username: 魔人逗逗 | Original post link

I feel that the main focus is on image search, or combining embedding to perform contextual semantic search (embedding essentially converts text into multi-dimensional arrays, considering the relationships between characters during the conversion process, so the vectorized data implicitly contains context).

Previously, this area seemed mainly related to multimedia search, such as image search—converting images into multi-dimensional vectors, i.e., floating-point arrays, and then querying image similarity. After the rise of large models, there have been more scenarios combining embedding + knowledge base for contextual vector search, such as RAG.

Speaking solely about the embedding + knowledge base scenario, many platforms like dify now support “hybrid search” for knowledge base scenarios, which is keyword full-text search + vector search. Personally, I feel each has its pros and cons: for example, ① When asking natural language questions, hitting the knowledge base with full-text search has a lower probability but is precise when it hits; vector search has a higher hit probability but is relatively less precise (also related to the chosen vector search index and distance metric). ② Full-text search doesn’t combine well with the overall semantics of the entire sentence, supporting matching single words or characters can be problematic; but when vectorizing text, the algorithm generally considers the context of the whole sentence or paragraph, so the calculated vector contains contextual information.

To be blunt, any form of data can be converted into vectors through algorithms, thus enabling vector search (ANN). The two most important steps are actually vectorization algorithm and vector search algorithm (index structure + distance metric).

The above are my personal understandings of vector functionality and may not necessarily align with TiDB’s views.

| username: 边城元元 | Original post link

Multimodal data similarity retrieval

| username: YuchongXU | Original post link

Retrieve.

| username: Fly-bird | Original post link

You can try to make a graph database to store images and audio, but I’m not sure if the storage space and query efficiency have been optimized.

| username: ShawnYan | Original post link

Refactor GIS database with vector database?

| username: tony5413 | Original post link

It seems I need to work harder. I’ll give it a try after some time. Mark!!!

| username: 望海崖2084 | Original post link

I don’t know much about this yet, I still need to learn.

| username: 大飞哥online | Original post link

Study and learn.

| username: Jellybean | Original post link

Vector search has opened up a vast new field for TiDB, greatly enriching the product’s features and covering more user scenario needs.

Obviously, this provides more database options for scenarios involving more complex data storage and search, such as images and videos.

| username: 像风一样的男子 | Original post link

Looking forward to spatial functions like GIS and other features.

| username: Myth | Original post link

Large model

| username: DBAER | Original post link

I don’t know much about it and need to deepen my understanding in this area.

| username: yulei7633 | Original post link

I need to work twice as hard, I don’t even know what a vector is.

| username: coderv | Original post link

Search function

| username: Kongdom | Original post link

:yum: Just follow TiDB~

| username: twentycui | Original post link

More suitable for image and text data retrieval.

| username: tidb狂热爱好者 | Original post link

Help me predict stocks.