The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
These structures were uncovered from leaked financial documents and were analyzed by the journalists. They extracted the metadata of documents using Apache Solr and Tika, then connected all the information together using the leaked databases, creating a graph of nodes and edges in Neo4j and made it accessible using Linkurious’ visualization application.
In this post, we look at the graph data model used by the ICIJ and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.