Search: [Text_Analysis_&_Keywords_Extraction]

word-mesh

A wordcloud/wordmesh generator that allows users to extract keywords from text, and create a simple and interpretable wordcloud.

Why word-mesh?

Most popular open-source wordcloud generators (word_cloud, d3-cloud, echarts-wordcloud) focus more on the aesthetics of the visualization than on effectively conveying textual features. word-mesh strikes a balance between the two and uses the various statistical, semantic and grammatical features of the text to inform visualization parameters.

Features:

keyword extraction: In addition to 'word frequency' based extraction techniques, word-mesh supports graph based methods like textrank, sgrank and bestcoverage.
word clustering: Words can be grouped together on the canvas based on their semantic similarity, co-occurence frequency, and other properties.
keyword filtering: Extracted keywords can be filtered based on their pos tags or whether they are named entities.
font colors and font sizes: These can be set based on the following criteria - word frequency, pos-tags, ranking algorithm score.

TagCloud · NLP · Text_Analysis_&_Keywords_Extraction · TextRank/tags_&_keywords_extraction/text_summarization · DataVisualization · Python

July 17, 2018 09:24:19 AM GMT+02:00 * · permalink

·

https://github.com/mukund109/word-mesh

GitHub - snipsco/snips-nlu: Snips Python library to extract meaning from text

allows to parse sentences written in natural language and extracts structured information.

Text_Analysis_&_Keywords_Extraction · Python · NLP

March 26, 2018 04:55:12 PM GMT+02:00 · permalink

·

https://github.com/snipsco/snips-nlu

Apache Tika – Apache Tika

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

ApacheTika · Search_Engine · OpenSource · OCR_TextExtraction · Java · Text_Analysis_&_Keywords_Extraction

November 7, 2017 01:57:26 PM GMT+01:00 · permalink

·

https://tika.apache.org

Analyzing the Panama Papers with Neo4j: Data Models, Queries & More

These structures were uncovered from leaked financial documents and were analyzed by the journalists. They extracted the metadata of documents using Apache Solr and Tika, then connected all the information together using the leaked databases, creating a graph of nodes and edges in Neo4j and made it accessible using Linkurious’ visualization application.
In this post, we look at the graph data model used by the ICIJ and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.

Neo4j · graph · OCR_TextExtraction · Text_Analysis_&_Keywords_Extraction · journalisme · network · DataVisualization · ApacheTika

November 6, 2017 05:40:19 PM GMT+01:00 * · permalink

·

https://neo4j.com/blog/analyzing-panama-papers-neo4j/

A First Exercise in Natural Language Processing with Python: Counting Hapaxes - The Cat's Whisker

A tutu on how to count hapaxes (words which occur only once in a text or corpus) using NLTK.
Some alternatives mentioned:

Pattern : Python package for datamining the WWW which includes submodules for language processing and machine learning
Polyglot : language library focusing on "massive multilingual applications"
spaCy : an "industrial strength" NLP library focused on performance with a streamlined API

Python · NLTK · Tuto · Text_Analysis_&_Keywords_Extraction · Prog

September 15, 2017 05:51:06 PM GMT+02:00 · permalink

·

http://catswhisker.xyz/log/2017/9/7/a_first_excercise_in_natural_language_processing_with_python_counting_hapaxes/

GitHub - mozilla/fathom: A framework for extracting meaning from web pages

Text_Analysis_&_Keywords_Extraction · TextRank/tags_&_keywords_extraction/text_summarization

April 28, 2017 11:10:44 AM GMT+02:00 · permalink

·

https://github.com/mozilla/fathom

MyScript - Text & equation recognition

Text_Analysis_&_Keywords_Extraction

December 14, 2015 01:42:06 PM GMT+01:00 · permalink

·

https://webdemo.myscript.com/#/home

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code - Tesseract & Leptonica OCR tools, Apache Spark, Apache Hadoop, Apache Solr, Apache HBase

Text_Analysis_&_Keywords_Extraction

October 30, 2015 10:45:03 PM GMT+01:00 · permalink

·

http://blog.cloudera.com/blog/2015/10/how-to-index-scanned-pdfs-at-scale-using-fewer-than-50-lines-of-code/

Solr vs. Elasticsearch — How to Decide? | Sematext Blog

by Otis Gospodnetić [Otis is a Lucene, Solr, and Elasticsearch expert and co-author of "Lucene in Action" (1st and 2nd editions). He is also the founder and CEO of Sematext. See full bio below.] “Solr or Elasticsearch?”...well, at least that is the common question I hear from Sematext’s consulting services clients and prospects. Which one…

Text_Analysis_&_Keywords_Extraction · Search_Engine

February 25, 2015 05:09:52 PM GMT+01:00 · permalink

·

http://blog.sematext.com/2015/01/30/solr-elasticsearch-comparison/

>>Doxa - Prospero - Marlowe <<

Programme de Sociologie Pragmatique, Expérimentale et Réflexive sur Ordinateur (© Doxa 1995-2012)

Text_Analysis_&_Keywords_Extraction

January 5, 2015 11:28:48 PM GMT+01:00 · permalink

·

http://prosperologie.org/

alvations/pywsd - Word Sense Disambiguation

pywsd - Python Implementations of Word Sense Disambiguation (WSD) Technologies.

Text_Analysis_&_Keywords_Extraction

November 14, 2014 10:03:40 PM GMT+01:00 · permalink

·

https://github.com/alvations/pywsd

gigablast/open-source-search-engine

open-source-search-engine - A distributed open source search engine and spider written in C/C++ for Linux on Intel/AMD. From gigablast dot com, which has binaries for download. See the README.md file at the very bottom of this page for instructions.

Text_Analysis_&_Keywords_Extraction · Search_Engine

October 14, 2014 01:26:21 AM GMT+02:00 · permalink

·

https://github.com/gigablast/open-source-search-engine

asciimoo/searx - hackable metasearch engine

searx - A privacy-respecting, hackable metasearch engine

Text_Analysis_&_Keywords_Extraction · Search_Engine

October 13, 2014 06:03:06 PM GMT+02:00 · permalink

·

https://github.com/asciimoo/searx

Nine Tips on Configuring Elasticsearch for High Performance |

This post is a checklist for optimizing Elasticsearch’s configurations to deliver maximum performance, based on lessons we learned with log management.

Text_Analysis_&_Keywords_Extraction · Search_Engine

August 25, 2014 12:11:18 PM GMT+02:00 · permalink

·

https://www.loggly.com/blog/nine-tips-configuring-elasticsearch-for-high-performance/

tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting

Text_Analysis_&_Keywords_Extraction · OCR_TextExtraction

August 3, 2014 11:47:45 PM GMT+02:00 · permalink

·

https://code.google.com/p/tesseract-ocr/

Free online OCR

Extract text from images with this free online OCR tool. No registration, no email.

Text_Analysis_&_Keywords_Extraction · OCR_TextExtraction

August 3, 2014 11:47:32 PM GMT+02:00 · permalink

·

http://www.free-ocr.com/

Free Online OCR - Convert JPEG, PNG, GIF, BMP, TIFF, PDF, DjVu to Text

Free online OCR service that allows to convert scanned images, faxes, screenshots, PDF documents and ebooks to text, can process 58 languages and supports layout analysis

Text_Analysis_&_Keywords_Extraction · OCR_TextExtraction

August 3, 2014 11:46:54 PM GMT+02:00 · permalink

·

http://www.newocr.com/