Part 2: The (Even More) Final Problem: A Story of NLP-Generated Relationship Graphs

A series of blog posts by Lucas Zurbuchen

Part 2: Crafting a Complex Network with NLP

In the first part we understood how to collect the relevant data for extracting entities out of the documents available. Now let’s have a look at how entities are connected to a complex network with NLP.

There is both bad and good news with using extracted events as the base data for our relationship graph. The bad news is that an extracted event itself doesn’t reveal any relationships. Still, the good news is that there are plenty of new advances in Large Language Models (LLMs) and GenAI that can increase the quality, speed, and flexibility of entity extraction than what was possible before. The wide variety of prompts one can give GenAI with accurate outputs plus its multilingual capabilities makes it a great fit for entity extraction. GenAI can be slower than non-AI NLP tools, but Herlock mitigates some of this by currently using quicker LLMs at the expense of having to write simpler prompts. However, with the current pace of LLMs, speed may no longer become much of an issue.

Using GenAI, Herlock can examine the extracted data to get the information in the RG’s ontology (the set of categories and relations in a network). This includes people, organizations, locations, and relationships between these categories in the exchange. GenAI is also useful in collecting even more detailed information about exchanges between entities such as how people felt about it, what they were speaking about, and how it was carried out.

Collecting this information isn’t too difficult but involves good knowledge of prompt engineering. Specifically, it involves knowing how you can balance what you want as an answer and what GenAI is trained to answer. Micromanaging GenAI too much in your prompt can lead to decreased performance because each additional rule that you prompt runs the risk of GenAI having less training data to back up its answer. As a result, it’s good to notice that, despite its current ability, there is a limit to what GenAI can reliably do for you which you can leverage in your prompt to get the best answers.

Nonetheless, after collecting all of these entities and relationships that shape our ontology, we return the graph with the sets of extracted entities and relations.

Now, we can merge certain entities that represent the same thing, using our data from both the newly extracted entities and the relationship graph with all other documents. This prevents instances like “John Doe” and the acronym “JD” from being different entities in the RG even though these talk about the same person. Usually, we can reliably carry out this step without LLMs using well-known algorithms like either Hamming or Levenshtein Distance. For locations, we can also use APIs to tell if two locations are the same. However, this doesn’t work for everything. As a result, we can use GenAI as a safety net to double check that there are truly no matches with entities, returning a match otherwise.

After collecting all sets of matches in our ontology’s categories, it’s just a matter of updating the RG so that all of the graph’s relations go to only one of the entities that represent the same thing. Which one of these entities, you may ask? Since this can be a very nuanced question, this is a perfect task for GenAI as well, displaying how much we can leverage LLMs to create an effective RG of a case.

The step of weighting entities isn’t as complex as entity extraction or merging, but it’s essential to ensuring that the graph can be represented in a logical and visually appealing way. Weights are added to the edges (based on how frequently that relation arises in the text) so that thicker edges in the RG signify stronger relationships. At the same time, weights are added to the nodes (based on the sum of all outgoing edge weights) to display the relative importance of each entity in the case. Lastly, we can use GenAI to find the common theme of extracted relationships if there is more than one type of relationship between two entities. This allows us to avoid using too much text for an edge label in the visualization.

We have a presentable graph now, but depending on the case and number of documents the graph is far too complicated to navigate. Next week we’ll add some hierarchical concepts that will solve this problem.

Previous
Previous

Part 3: The (Even More) Final Problem: A Story of NLP-Generated Relationship Graphs

Next
Next

Part1: The (Even More) Final Problem: A Story of NLP-Generated Relationship Graphs