Knowledge graphs have been taking the world of data science and engineering by storm, and for good reason. They make your data more transparent, connected and improve the accuracy of machine learning pipelines. This blog post will explain some example use cases of graphs in the data space.
For the uninitiated, I’d like to give a brief introduction into the world of graphs. In essence, knowledge graphs describe datapoints and the relations between them. This may seem simple, and on the surface it really is. But if we think about it a little more, we see that it can greatly improve the efficiency of your data. Instead of just having our isolated datapoints, we get a bunch of free information about any datapoint from its neighbourhood in the graph.
Generally, whenever we are working with data, we are working with a LOT of data. And it’s difficult to get a clear overview. Summary statistics can be of great help in cases such as this, but are not always sufficient, or at the level of expressivity that we desire. This is where graph visualisations can be of great value. Think for example about how you would give a visual summary of a supply chain. A graph-like expression would be very intuitive here.
It’s not only supply chains that can benefit from the graph. Let’s look at an example of exploratory free-text analysis via the graph. Imagine, you are presented with 10 000 articles and you want to extract the most important insights. Reading every single one of the is clearly not an option. Maybe we could use some automatic summarisation procedure. But, generally, their quality is less than perfect and then we would still have to read 10 000 summaries. We would like a more automatic option that can quickly give basic insights. This is where named entity linking enters the picture.
Let’s examine how this works via a simple example. Imagine we want to process the following sentence:
We can use an AI algorithm to process this to:
Here, we see the algorithm extract 3 entities: The Eiffel Tower, Paris and France. We also find 3 Wikidata ids (the Q-codes to the right) that link to additional information about the entities. For example, for France we find:
So, now as part of our pipeline, we automatically find that France is a country, which could prove very useful if we specifically want to extract information about countries.
With this procedure, we extract all the entities and their respective Wikidata ids from the articles and link them to the article. Our general graph structure looks like the following:
where the central node is an article and all the other nodes are entities. Here, we already annotated the entities that are recognised as ‘person’ by their Wikidata id in yellow.
With this, we could already get some summary statistics. For example, we can now already answer the question: “In how many articles do certain people or companies occur?” But for this, we didn’t really need the graph yet. We’d like to go further in our analysis.
If 2 entities occur in the same article we link them via a ‘co-occurence’ relationship and examine only the ‘person’ entities and the article co-occurences between them. This yields an interesting subgraph.
However, it is still very difficult to gain insights from this representation. We perform some graph algorithms on this mess to make it clearer. All graph algorithms that we use are natively implemented in the Neo4j gds library.
First, we use the weakly connected components algorithm to extract the largest connected subgraph of the ‘person’ nodes. We expect the largest subgraph to contain the most important people, and so the ‘person’ nodes that we threw away should be less important for our analysis.
Next, we perform the pagerank algorithm on this subgraph in order to find some measure of importance for every person in the network.
Lastly, we use the Louvain clustering algorithm to subdivide the people in the network into distinct clusters.
We can visualise this in Neo4j Bloom. We set the node sizes to be proportionate to the pagerank score and the node colours for each Louvain cluster.
This yields us the following network
where each cluster represents some well-defined group with (relatively) little unexpected outliers. We find for example that the purple cluster is a cluster of dutch people, mostly singers.
Amazing how we extracted this simply from a bunch of free-text articles. We can now also manually detect some anomalies. For example, Jay-Z occurs in the ‘classical musicians’ cluster.
He seems to occur in an article together with Beethoven, Bach and Mahler. We can find the article in the database to see where this comes from through a simple Cypher query.
We find an article about how AI is used in music. AI has been used to extend the oeuvres of classical musicians, as well as create voice replicas of a lot of musicians. One of those is Jay-Z and apparently he started a lawsuit because someone made a deepfake with his voice. Very interesting!
In any case, we find a representation of our free text data that is a lot easier to understand and get insights from. Different graphs can be created and extended with Wikidata, according to your interests.
We now know how a graph can help people with their data. Let’s now check out how we can use the graph to make computers understand and process data. We do this by looking at the CORA dataset. This is a dataset containing academic papers classified into 7 classes. We get Bag-Of-Words input vectors and citations between the papers in order to predict these classes.
We first train some machine learning models (random forest, MLP, support vector machine, k-nearest neighbour) on the input vectors without taking the citation network into consideration. After that we train these same models by extending the features with an embedding of the citation network structure. We made the embedding in 128D space. The embedding tries to push nodes close together when they share a link (a citation) and far apart when they share no link. When we visualise the embedding using T-SNE and PCA, we find the following 2 figures respectively.
We see that there clearly is some structure here that separates and clusters the papers according to their field of study (annotated as ‘label’ in the legend). And thus, we can hope that adding these features may yield better results. We run the ML models again with these added features and also run graph-native deep learning models GCN and graphSAGE. Their papers can be found in following links (GCN, graphSAGE). We compare the performance, expressed as accuracy.
GCN and graphSAGE perform massively better than the other algorithms. There is a trade-off here though. It’s a lot harder to understand why GCN or graphSAGE made its classification decisions, compared to for example a random forest classifier (RFC). And we see that the RFC with the help of the embedded features also gets comparable accuracy. For all of the standard ML techniques, we find that the graph embedding yields a performance increase!
We have covered a lot of ground here. But even more can be done through the power of the graph. As we saw before, graph features can really improve the efficiency of ML algorithms. The possibilities with this are immense. If there is any linked structure in your data, it’s definitely worth it to represent this as a graph and engineer the graph to ML features, or run a graph-native algorithm on the graph, depending on your use case.
In the first part of the post, we looked at how the graph can yield quick insights in free text. But this isn’t the only area where graphs make a clear picture of the data. We could for example represent a supply chain network as a graph and use centrality algorithms and visualisation to figure out possible bottlenecks and improvements.
We explained what the knowledge graph is and showcased its power as a data science tool in different settings. The use cases of graphs are plentiful. If you think the graph can help you in your data driven processes, we’d love to figure out the best way to integrate this. Don’t hesitate to contact us at email@example.com!