Context matters when using Large Language Models like GPT-4 and Claude, especially when discussing specialized topics. The key to effective model prompting often lies in Retrieval Augmented Generation (RAG), where content—such as an SEC filing—is broken down into manageable text chunks. These chunks are then converted into vector embeddings for easier retrieval. Traditional splitting methods, which segment text by token count or sentence, often fall short, leaving developers to craft their own, often labor-intensive, solutions.
Enter Neum AI's new feature: context-aware text splitting. This feature allows for custom strategies that better suit specific documents. It's a game-changer for consistent datasets like templated contracts or user-uploaded files, enhancing both retrieval quality and overall application performance.
In this blog, we will showcase how the text splitter works and share a tutorial to start using it. We will introduce neumai-tools, an open-source, python module that contains tools to pre-process documents as well as the inclusion of context aware text splitting inside the Neum AI pre-processing playground.
How does it work?
We start with a collection of documents that generally follow a given template like contracts, FAQs, etc. We will try to generate a strategy for splitting those documents that we can apply across all the documents. The goal is that the strategy we generate provides a better result that blindly splitting it by sentence or number of tokens.
We will take a couple of the documents to use as a sample. Given that the documents are similar, we can pick any two that are a good approximation or even use a template if there is one. (Ex. master contract or spec template) Once we have those, we can use LLMs to analyze the documents and help generate a strategy.
We will use a multi-shot prompt system to ensure that we apply our thinking across different steps and yield the best result possible. As a pre-processing step, removing any covers, table of contents or abstracts can help ensure that we do our analysis of the meatiest parts of the document.
Chunking strategy
For the first prompt we will be generating a strategy to split the documents. This will be the most expensive / time consuming step, but we want to ensure to use a high quality model that can analyze the documents and provide a good approximation. The output of this step is a high quality outline of the steps to take, any obvious markings / format that we can parse across, etc.
Chunking code
Once we have the chunking strategy established, we then use a second prompt to help generate the code to be applied to the text. With this code, we can easily run subsequent documents through the same set of transformations.
Chunking runtime
After the code is generated, we check the code to make sure it is correct and runnable. If it has any issues, we can re-generate it / fix it.
Example outputs
End result of the process is a piece of code that we can use to split up text documents that follow a similar structure. For example, this is what the process yields for a couple sample documents:
Q&A Documents
For this case, we have a document organized in questions and answers. The smart splitter identified the format and divides it to keep questions and answers together in the same chunk.
Contracts (ex. SAFE)
For this case, we have a standard SAFE contract. The smart splitter identified the format and generated several regex to identify sections, paragraphs and sentences within the text to then generate chunks out of them.
You can try out the smart chunker yourself directly on the Neum AI pre-processing playground by choosing it in the text splitting section:
Integrating smart splitting into your flow
To get started, we will need to install the pip package for neumai-tools. This package includes several utilities for pre-processing documents as part of a RAG data pipeline. We will also install langchain and unstructured[all-docs] to use them in our examples.
Once installed, we can then implement code that leverages the semantic_chunking_code and semantic_chunking utilities.
- semantic_chunking_code: Outputs the code generated by the system based on a sample piece of document. (Up to 2000 tokens). The code comes out ready to be executed inside of a function called split_text_into_chunks.
- semanting_chunking: Takes as an input the generated code and the full set of documents to be split by it. It outputs a list of Document objects that contain the text chunks and can be used to generate embeddings.
We now have the splitter_code generated based on the sample text we provided. We can now take that splitter code and apply it across other pieces of text / documents. In this case we will use LangChain loaders to get the text off a document and pass it on to the semantic_chunking code.
This code will returns a list of Document objects with the chunks generated.
Conclusion
Pre-processing continues to be a key step in the process of creating great generative AI applications that are grounded on your own data. We believe that by leveraging intelligence we can simplify pre-processing while increasing the quality of the results at scale. Simple tools like the ones above can help stir you in that direction. Please share any feedback you might have you try out these methods.
Outside of pre-processing, scaling data pipelines for vector embeddings continues to be a challenge. As you move past initial experimentation, check out Neum AI as a platform to help you scale your applications while keeping the quality up and the latency & cost down. Neum AI provides access through the platform to capabilities like context-aware text splitting and more. Stay tuned to our social media (Twitter and LinkedIn) and discord for more updates.