A wordcloud/wordmesh generator that allows users to extract keywords from text, and create a simple and interpretable wordcloud.
Why word-mesh?
Most popular open-source wordcloud generators (word_cloud, d3-cloud, echarts-wordcloud) focus more on the aesthetics of the visualization than on effectively conveying textual features. word-mesh strikes a balance between the two and uses the various statistical, semantic and grammatical features of the text to inform visualization parameters.
Features:
- keyword extraction: In addition to 'word frequency' based extraction techniques, word-mesh supports graph based methods like textrank, sgrank and bestcoverage.
- word clustering: Words can be grouped together on the canvas based on their semantic similarity, co-occurence frequency, and other properties.
- keyword filtering: Extracted keywords can be filtered based on their pos tags or whether they are named entities.
- font colors and font sizes: These can be set based on the following criteria - word frequency, pos-tags, ranking algorithm score.
allows to parse sentences written in natural language and extracts structured information.
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
These structures were uncovered from leaked financial documents and were analyzed by the journalists. They extracted the metadata of documents using Apache Solr and Tika, then connected all the information together using the leaked databases, creating a graph of nodes and edges in Neo4j and made it accessible using Linkurious’ visualization application.
In this post, we look at the graph data model used by the ICIJ and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.
A tutu on how to count hapaxes (words which occur only once in a text or corpus) using NLTK.
Some alternatives mentioned:
- Pattern : Python package for datamining the WWW which includes submodules for language processing and machine learning
- Polyglot : language library focusing on "massive multilingual applications"
- spaCy : an "industrial strength" NLP library focused on performance with a streamlined API
by Otis Gospodnetić [Otis is a Lucene, Solr, and Elasticsearch expert and co-author of "Lucene in Action" (1st and 2nd editions). He is also the founder and CEO of Sematext. See full bio below.] “Solr or Elasticsearch?”...well, at least that is the common question I hear from Sematext’s consulting services clients and prospects. Which one…
Programme de Sociologie Pragmatique, Expérimentale et Réflexive sur Ordinateur (© Doxa 1995-2012)
pywsd - Python Implementations of Word Sense Disambiguation (WSD) Technologies.
open-source-search-engine - A distributed open source search engine and spider written in C/C++ for Linux on Intel/AMD. From gigablast dot com, which has binaries for download. See the README.md file at the very bottom of this page for instructions.
searx - A privacy-respecting, hackable metasearch engine
This post is a checklist for optimizing Elasticsearch’s configurations to deliver maximum performance, based on lessons we learned with log management.
Extract text from images with this free online OCR tool. No registration, no email.
Free online OCR service that allows to convert scanned images, faxes, screenshots, PDF documents and ebooks to text, can process 58 languages and supports layout analysis
newspaper - News extraction, article extraction and content curation in python. Built with multithreading, 10+ languages, NLP, ML, and more!