RAG to Agent workflow

6 min readMar 12, 2024

Introduction to RAG and Agents workflow

Foreword

The wave of large language models (LLMs) has swept almost every industry, but when it comes to professional scenarios or industry segments, general large models will face the problem of lack of professional knowledge. RAG-based technology solutions are often a better choice than expensive “Post Train” or “SFT”. This article starts with the RAG architecture, introduces the relevant technical details in detail, and attaches a practical case.

What is RAG?

Retrieval Augmented Generation, or RAG for short, has become the hottest LLM application solution. After the wave of large models at the beginning of this year, we must have a certain understanding of the capabilities of large models, but when we apply large models to actual business scenarios, we will find that the general basic large models can basically not meet our actual business needs, mainly due to the following reasons:

Limitations of knowledge: The knowledge of the model itself is entirely derived from its training data, while the existing mainstream large models (ChatGPT, Gemini, llama2, Claude2…) The training set is basically built on the data that is publicly available on the network, and some real-time, non-public or offline data cannot be obtained, so this part of the knowledge is not available.

Hallucination problem:

The underlying principle of all AI models is based on mathematical probability, and the output of the model is essentially a series of numerical operations, and the large model is no exception, so it will sometimes talk nonsense seriously, especially in scenarios where the large model itself does not have knowledge of a certain aspect or is not good at it. The distinction between this hallucination problem is more difficult because it requires the user to have knowledge of the corresponding field.

Data security: For enterprises, data security is of paramount importance, and no enterprise is willing to take the risk of data leakage and upload its private domain data to a third-party platform for training. This also leads to applications that rely entirely on the capabilities of the general large model to make trade-offs in terms of data security and effectiveness.

RAG is an effective solution to these problems.

RAG architecture

The architecture of RAG is shown in the figure, in simple terms, RAG obtains relevant knowledge through retrieval and integrates it into Prompt, so that the large model can refer to the corresponding knowledge and give reasonable answers. Therefore, the core of RAG can be understood as “retrieval + generation”, and the former mainly uses the efficient storage and retrieval capabilities of vector databases to recall target knowledge; The latter is to use large models and Prompt engineering to make reasonable use of the recalled knowledge and generate target answers.

The complete RAG application process consists of two main phases:

Data preparation stage: data extraction, > text segmentation, > vectorization (embedding), > data storage

Application stage: user questions → data retrieval (recall) → injection Prompt → LLM to generate answers

Let’s take a closer look at the technical details and precautions of each link:

Data Preparation Phase:

Data preparation is generally an offline process, which is mainly the process of vectorizing private domain data, building indexes, and storing it in the database. It mainly includes: data extraction, text segmentation, vectorization, data storage and other links.

Data extraction:

Data loading: Data is loaded in multiple formats and obtained from different data sources, and data is processed into the same paradigm according to the data itself.

Data processing: including data filtering, compression, formatting, etc.

Metadata acquisition: Extract key information from the data, such as file name, title, and time.

Text Segmentation:

There are two main factors for text segmentation: 1) the Tokens limitation of the embedding model; 2) the impact of semantic integrity on the overall retrieval performance. Some common ways to split text are as follows:

Sentence segmentation: Segmentation is performed at the granularity of “sentences” to retain the complete semantics of a sentence. Common syncopations include: periods, exclamation marks, question marks, line breaks, etc.

Fixed-length segmentation: According to the token length limit of the embedding model, the text is divided into a fixed length (for example, 256/512 tokens), which will lose a lot of semantic information, which is generally mitigated by adding a certain amount of redundancy at the beginning and end.

Vectorization (embedding):

Vectorization is the process of converting text data into a vector matrix, which will directly affect the effect of subsequent retrieval. The current common embedding models are shown in the table, which can basically meet most of the requirements, but for special scenarios (such as involving some rare words or words, etc.) or want to further optimize the effect, you can choose the open-source embedding model to fine-tune or directly train the embedding model suitable for your own scenario.

EX:

chat-GPT Embedding :ChatGPT-Embedding is provided by OpenAI and is called in the form of an interface. which address is : https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
M3E: M3E is a powerful open-source embedding model, which includes multiple versions such as m3e-small, m3e-base, and m3e-large, and supports fine-tuning and on-premise deployment. which address is : https://huggingface.co/moka-ai/m3e-base

Data Storage:

The process of building an index after data vectorization and writing it to a database can be summarized as the data storage process, which is suitable for RAG scenarios: FAISS, Chromadb, ES, and milvus. In general, you can select an appropriate database based on multiple factors, such as business scenarios, hardware, and performance requirements.

Application Phase:

In the application stage, we recall the most relevant knowledge through efficient retrieval methods according to the user’s questions and integrate them into Prompt. The large model refers to the current question and related knowledge to generate the corresponding answer. The key links include: data retrieval, injection prompt, etc.

Data retrieval

Common data retrieval methods include: similarity search, full-text search, etc., according to the retrieval effect, a variety of retrieval methods can generally be selected to improve the recall rate.

As a direct input of a large model, prompt is one of the key factors affecting the output accuracy of the model. In RAG scenarios, prompts generally include task descriptions, background knowledge (retrieved), and task instructions (usually user questions), etc., and other instructions can be added to the prompt to optimize the output of the large model according to the task scenario and the performance of the large model. The prompt for a simple quiz scenario looks like this:

#【Description】
If you are a professional customer service robot, please refer to [Background Knowledge], back
#【Background】
{content} - The relevant text from which the data was retrieved
#【Question】
How long is the battery life of the Roborock Robot Vacuum Cleaner P10?

The design of Prompt only has methods and no syntax, which is more dependent on personal experience, and in the actual application process, it is often necessary to optimize Prompt according to the actual output of the large model.

Hope you guys can understand those contents~ RAG is really really important~

Okay, See you.