How to use AI for searching and grouping unstructured data
Imagine you have a lot of unstructured data, be it business documents, social media interactions, communications scans, or videos. You want to reliably search and group similar content. Or you need to extract specific information to enrich your database. In either case, semantic embeddings can help you reach your goals and go a step further than keyword-based methods.
Fast and reliable access to information is of paramount importance for all organizations. Without it, the processes and communication suffer or fail all together.
In this post, you will learn to tackle the searching and grouping of unstructured text documents. However, a sizable chunk of the methodology is similar for different data types. We will cover more topics around unstructured documents in the upcoming articles.
Organizations realized it already at the beginning of the computer era in the 1980s when the popularity of relational databases grew significantly due to efficient querying, indexing, and more affordable storage. Relational databases, which store data in a set of related tables, are de facto a standard for storing and accessing information. The tabular format is often sufficient to capture the critical data of the business, like information on the orders or customers. With the help of programming languages such as SQL or Datalog, one can query the database in a structured manner to access the desired information.
The languages used for setting up and searching relational data sources are well optimized, easily accessible, and fast. One could think that searching for desired information is a solved problem. There is one issue, however: 80-90% of data is unstructured [1]. Examples include document collections such as invoices, records, emails, or rich media like photos or videos. The relational data model is no longer suitable. Reliably searching this data is much harder or even impossible. Yet, this kind of data stores salient information that can be used for:
- imposing structure on the unstructured data dump
- search
- process optimization (eg. customer support)
- compliance assurance
- enriching the tabular data
and much more.
In this post, you will learn to tackle the searching and grouping of unstructured text documents. However, a big chunk of the methodology is similar for different data types. We will cover more topics around unstructured documents in the upcoming articles.
Classic approach: lexical search
In a nutshell, lexical search uses the words used in the query to rank documents from the document collection with the largest query-document overlap. There are some bells and whistles, such as query filtering and augmentation. They often improve the results, but we will not get into them in this article. If you are interested, there are plenty of great resources [2][3].
What’s the problem with lexical search?
The most significant issue in this approach is the inability to issue queries in natural language. The search result relies on the exact matches, so the semantically comparable results will not be returned.
Let’s look at an example. Say that we want to look for a contract between some parties. We issue a query “contract between technology companies”. However, if the document you’re looking for is titled “binding agreement between Apple and Samsung”, the lexical search might fail.
Expressing the same thing in different words is also challenging for classical search engines. This might lead to the situation where the search engine yields completely different results for semantically equivalent queries.
Image search and query misunderstanding stemming from word order are two other prevalent issues of classic search engines.
So how can we tackle the problems described above? The answer is semantic embeddings!
Semantic embeddings for search and grouping
The main idea underlying semantic search is simple: represent text as a list of numbers, i.e. a vector. This vector is what is called semantic embedding. Conceptually, each number in the vector stands for some characteristic of the text. This approach then allows you to compare the different embeddings with each other by some distance measure.
Let’s look at an example. Say we have vectors that represent product reviews. Each vector has two dimensions: sentiment and informativeness. Now let’s say we want to look for critical yet informative reviews to improve our product. We can then query our vectors with low sentiment and high informativeness.
This approach can be easily extended to a query-based search. Given a vector representation of a query (or even a document), we can look for the most similar vector representation of a document from our document collection.
Naturally, we can re-use this approach for grouping. Given a collection of vectors, we can use one of many clustering algorithms [4] to find groups such as customer segments within the data.
Note that, we cannot (easily) interpret each dimension of the vector we obtain from some model.
Still, pretty neat, right?
The question you might have now is how we get these semantic embeddings. Glad you asked!
The embeddings can be obtained by using sentence-encoders models [5] that return a fixed-size vector, usually with dimensionality between 128 and 1024. These are most often variations of the deep learning language model called BERT [6]. But don’t worry if that doesn’t tell you much.
The sentence-encoder models are optimized with a straightforward objective. Imagine you have a lot of similar sentence pairs. Then given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was paired with it in our dataset. This way the model learns semantic similarity! If you want to learn more on this topic [7] contains a list of great publications.
Limitations
Even though the semantic embeddings are extremely useful they still suffer from limitations.
Notably, the very short and long documents might not yield an accurate embedding. Whereas the former suffers from too little information, the latter has the opposite problem. Long documents might cover many different topics leading to inaccurate model interpretation. One remedy for this is to process the text in chunks and then smartly combine the resulting vectors.
Another limitation is speed. Sentence-encoding models rely on transformer architectures that need GPUs for fast inference. For larger datasets, one must ensure appropriate resources are available. One should also consider using a hybrid model of lexical and semantic search. Sometimes it is the only feasible option.
Conclusion
In this post, we covered the motivation for processing unstructured data, the methodology and weaknesses of classical approach, and semantic embeddings together with its limitations.
Using semantic embeddings is a universal tool that can enhance performance in search, clustering, and other tasks not covered here, such as summarization.
At Zenit, we make extensive use of the latest tools for dealing with unstructured data. If your business wants to extract knowledge from the unstructured data, please reach out and we will figure it out together!
References:
[1] https://monkeylearn.com/blog/structured-data-vs-unstructured-data/
[2] https://dtunkelang.medium.com/query-understanding-divided-into-three-parts-d9cbc81a5d09
[3] https://towardsdatascience.com/understanding-the-search-query-part-i-632d1b323b50
[4] https://en.wikipedia.org/wiki/Cluster_analysis
[5] https://arxiv.org/pdf/1908.10084.pdf
[6] https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270