In this article we show why and how graph databases are useful for traceability use cases. We discuss the value proposition of granular traceability and showcase the power of graph databases such as Neo4j in the context of a food traceability use case.
Traceability is the ability to trace all processes from procurement of raw materials to production, consumption and disposal to clarify when and where the product was produced by whom.  It enables process industries such as automotive, electronics, pharmaceutical and foods to monitor where their products have been delivered (= trace forward) while downstream companies and consumers can understand where products come from (= trace back).
In a nutshell, traceability allows companies to
Gartner put traceability as one of the top 10 strategic technology trends for 2020. In their words:
Enterprise architecture and technology innovation leaders must take responsibility for introducing technologies and best practices that increase transparency and traceability to manage a wide range of social, legal and commercial risks.
To facilitate this transition, it is important that we choose appropriate technologies for the job to guarantee speed, scalability and robustness of our traceability pipeline. In the following sections we will showcase by example why graph databases are a crucial technology for these systems.
Neo4j is a graph database, that places emphasis primarily on the relationships between the datapoints, rather than the datapoints themselves. In relational database systems, these multi-hop relationships are established at query-time by performing table joins, in graph databases such as Neo4j, relationships are explicitly stored and established and insert-time. This makes them performant for traversals on nested datasets.
In our context, we want to primarily know what ingredients were used to create a given food product (trace back). This means we want to find all the processes to which our final product is related. Let's discover how we can accomplish this in Neo4j!
We've set up a data stream, containing production and distribution data of bread. The process starts out with the harvest of grain, this grain is then processed to flour, which gets baked to bread, this bread then gets distributed to vending machines, where purchases are made by customers. The figure below gives us the schema of this data, the processes in the schema are encircled. We start with a Grainbatch on the left and trace the production and consumption to purchases by customers on the right.
As we will see below, Neo4j enables us to easily query this kind of deeply nested data.
Let's imagine the scenario where a hardworking farmer harvests a batch of grain, let's say Grainbatch 1. Unbeknownst to him, some chemical company in the area has been dumping their waste products in the vicinity, contaminating the soil and the Grainbatch that he harvested. As such, all the bread that is produced from this Grainbatch is contaminated and we want to find all the customers that bought bread originating from the Grainbatch to alert them and hopefully prevent them from consuming it.
In databases where the relations are not first class citizens, it would be quite cumbersome to make a query for this kind of scenario. We could easily make an error in the query, that would leave some customers completely out in the cold.
In Neo4j, however, this query is a piece of cake. We can use the following one-liner to find all the customers that have products containing grain from Grainbatch 1:
An important feature is that the above query stays consistent if the schema for our data ends up changing:
If for some bread types we end up using flour that is first refined in some flour refinery process, we would get a schema for our data as in the figure below. Our original path, query 2, will still return the right customers but customer that have products containing refined flour will also be returned through Query 1. Our original query syntax does not change but includes now both query paths.
With other database solutions, we would have to adapt our query for all the different paths to find the customers. It's extremely easy to make a mistake here by forgetting some niche path(s) in the query. In Neo4j, the original query keeps functioning without fault. It is very robust in this context.
Since relations are first class citizens in Neo4j, querying over long paths like this is extremely time efficient compared to other database solutions. Link
A database would not be worth much without the potential of large scalability. Luckily, a lot of work has been put in making Neo4j extremely scalable, with proof of concept graphs of more than a trillion relationships. Link
Neo4j's graph model brings a unique way of viewing data. Its integrated graph data science library allows us to uncover patterns in data that would previously be invisible. Link
Within the context of traceability, these insights could be used to get a better onderstanding of the entire supply-chain. Allowing us to quickly predict and mitigate potential bottlenecks, optimize route planning and more. Link
In this short article, we discussed traceability as a solution for the increased desire for transparency from consumers and as one of the main drivers for the next supply chain era.
We found that graph databases are an excellent technological option in this context with high robustness, speed and ease of use. This reduces potential mistakes by data engineers and facilitates strong analytics and transparency over the entire supply chain. Besides, using a graph database helps to predict unforeseen issues and drives to continuous advancements.
If you would like to see the entire food traceability demo,contact us and we will send you a link.
Curious to see how traceability can be used for your supply chain? We'd be delighted to give you more info and lets have a (digital) coffee: firstname.lastname@example.org