In RAG, how to process data is very important. In traditional RAG, the first step of data processing is chunking, which means dividing a piece of data into smaller portions. If not divided properly, sentences cut off in the middle can lead to incoherent data retrieval and misunderstandings by AI.
Taking 302.AI's Change Log as an example, here is the source text that needs to be chunked:
2024.9.5
[Omni Toolbox] Now when debugging in API Market, there's no need to manually enter API Key, the system will automatically fill it in
[Tools Market] AI Document Editor now supports one-click generation of extra-long documents and in-interface AI chat function
[Chat-bot] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, which focuses on providing useful, up-to-date, and accurate responses, from Perplexity
[API Market] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, from Perplexity
[API Market] DeepL added an interface for translating into any language
[Help Center] Updated several commonly used tool integration tutorials, such as Lobe-Chat, Immersive Translation, etc.
2024.9.3
[Management Backend] Omni Toolbox and Chatbot now support the option to display account balance, configurable in advanced options, disabled by default
[API Market] Video generation added Minimax's text-to-video
[API Market] Image generation added artistic QR code generation, from 302.AI
[Drawing-bot] Midjourney price reduced to 50% of the original
Existing Problems
Slice problem one: Sentence interruption
If we only consider the number of characters when slicing, cutting every 100 characters, it's very likely that sentences will be cut off in the middle. For example:
2024.9.5
[Omni Toolbox] Now when debugging in API Market, there's no need to manually enter API Key, the system will automatically fill it in
[Tools Market] AI Document Editor now supports one-click generation of extra-long documents and in-interface AI chat function
[Chat-bot] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, which focuses on providing useful, up-to-date, and accurate responses, from Perplexity
[API Market] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, from Perplexity
[API Market] DeepL added an
interface for translating into any language
[Help Center] Updated several commonly used tool integration tutorials, such as Lobe-Chat, Immersive Translation, etc.
2024.9.3
[Management Backend] Omni Toolbox and Chatbot now support the option to display account balance, configurable in advanced options, disabled by default
[API Market] Video generation added Minimax's text-to-video
[API Market] Image generation added artistic QR code generation, from 302
.AI
[Drawing-bot] Midjourney price reduced to 50% of the original
At this point, you will find that the text has been cut off in the middle.
Slice problem two: Paragraph interruption
Sentence interruption is actually a very basic problem that is easy to solve. We can determine whether a sentence is complete by using punctuation marks before cutting. Paragraph cutting is a bit more complex. For instance, in the previous example, if we consider sentence completeness, it might be divided into two parts:
2024.9.5
[Omni Toolbox] Now when debugging in API Market, there's no need to manually enter API Key, the system will automatically fill it in
[Tools Market] AI Document Editor now supports one-click generation of extra-long documents and in-interface AI chat function
[Chat-bot] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, which focuses on providing useful, up-to-date, and accurate responses, from Perplexity
[API Market] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, from Perplexity
[API Market] DeepL added an interface for translating into any language
[Help Center] Updated several commonly used tool integration tutorials, such as Lobe-Chat, Immersive Translation, etc.
2024.9.3
[Management Backend] Omni Toolbox and Chatbot now support the option to display account balance, configurable in advanced options, disabled by default
[API Market] Video generation added Minimax's text-to-video
[API Market] Image generation added artistic QR code generation, from 302.AI
[Drawing-bot] Midjourney price reduced to 50% of the original
This seems to be fine, but if the user asks a question:
What is the update on September 5th?
At this point, only the first item will be retrieved, because the second item does not contain any information related to September 5th. This way, the content "【Help Center】Updated several commonly used tool integrations...." in the second item would be missed.
Solution
Solution 1: Set Adjacent Text Overlap
In 302.AI, click on Advanced Settings in the knowledge base to see this setting:
This setting means that there is a certain overlap between each slice. For example:
2024.9.5
[Omni Toolbox] Now when debugging in API Market, there's no need to manually enter API Key, the system will automatically fill it in
[Tools Market] AI Document Editor now supports one-click generation of extra-long documents and in-interface AI chat function
[Chat-bot] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, which focuses on providing useful, up-to-date, and accurate responses, from Perplexity
[API Market] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, from Perplexity
[API Market] DeepL added an interface for translating into any language
[Help Center] Updated several commonly used tool integration tutorials, such as Lobe-Chat, Immersive Translation, etc.
2024.9.3
[API Market] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, from Perplexity
[API Market] DeepL added an interface for translating into any language
[Help Center] Updated several commonly used tool integration tutorials, such as Lobe-Chat, Immersive Translation, etc.
2024.9.3
[Management Backend] Omni Toolbox and Chatbot now support the option to display account balance, configurable in advanced options, disabled by default
[API Market] Video generation added Minimax's text-to-video
[API Market] Image generation added artistic QR code generation, from 302.AI
[Drawing-bot] Midjourney price reduced to 50% of the original
The section below appears in both fragments.
[API Market] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, from Perplexity
[API Market] DeepL added an interface for translating into any language
[Help Center] Updated several commonly used tool integration tutorials, such as Lobe-Chat, Immersive Translation, etc.
2024.9.3
If a user asks:
When was the integration tutorial updated?
Since the keyword "integration tutorial" appears in both fragments, both fragments will be recalled and given to the AI, which can then accurately determine the specific time.
Solution 2: Increase Slice Length
There's a saying, "Great effort yields miracles," and RAG is no exception. RAG is actually designed to reduce the burden on large models by reducing the context they process. But as models are becoming increasingly powerful, with some models having context lengths of up to 2M (Gemini-1.5-pro), it's possible to be less granular with slicing and feed more content to the large model, letting the AI process it.
In 302.AI, click on Advanced Settings in the knowledge base to see this setting:
When the AI's answer doesn't contain the information you're looking for, you can adjust this parameter - 500, 1000, 2000 are all possible. In the example above, it would be cut into just one fragment, ensuring the AI doesn't miss any information:
2024.9.5
[Omni Toolbox] Now when debugging in API Market, there's no need to manually enter API Key, the system will automatically fill it in
[Tools Market] AI Document Editor now supports one-click generation of extra-long documents and in-interface AI chat function
[Chat-bot] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, which focuses on providing useful, up-to-date, and accurate responses, from Perplexity
[API Market] Now supports pplx-8b-online, pplx-70b-online, pplx-405b-online, from Perplexity
[API Market] DeepL added an interface for translating into any language
[Help Center] Updated several commonly used tool integration tutorials, such as Lobe-Chat, Immersive Translation, etc.
2024.9.3
[Management Backend] Omni Toolbox and Chatbot now support the option to display account balance, configurable in advanced options, disabled by default
[API Market] Video generation added Minimax's text-to-video
[API Market] Image generation added artistic QR code generation, from 302.AI
[Drawing-bot] Midjourney price reduced to 50% of the original
However, note that the longer this length, the more context the AI processes, and costs will increase accordingly. So finding the best practice point will require you to test gradually.
Solution 3: Coarse Slicing + Fine Slicing
Sometimes, when slices are too coarse, it puts higher demands on the model, and less intelligent models might struggle to find the key points. In this case, we can use the method of slicing a file twice - once coarsely and once finely. This way, when the model retrieves, it will recall texts of both granularities, like a page of a book with highlighted points.
For fine slicing, we recommend slicing by semantics, which can more scientifically identify paragraphs, etc., from Jina.ai.
In 302.AI, click on Advanced Settings in the knowledge base to see this setting:
Summary
This only introduces a few simple data processing methods, but in production environments, the quality of the data itself is also very important. If the data itself is ambiguous or paragraphs are jumbled, optimization through slicing alone won't solve the problem. Therefore, you'll ultimately need to try different solutions with your own data to find what works best for you.