SentiWordNet
april 2011 by jonty
"SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity"
sentiment
nlp
wordnet
language
parsing
ai
text
from delicious
april 2011 by jonty
Pattern
february 2011 by jonty
"Pattern is a web mining module for the Python programming language. It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks)."
python
datamining
nlp
web
data
parsing
text
language
twitter
google
wikipedia
sentiment
analysis
flickr
lsa
wordnet
ngram
html
dom
parser
graph
visualisation
from delicious
february 2011 by jonty
construct
january 2011 by jonty
"Construct is a python library for parsing and building of data structures (binary or textual). It is based on the concept of defining data structures in a declarative manner, rather than procedural code: more complex constructs are composed of a hierarchy of simpler ones. It's the first library that makes parsing fun, instead of the usual headache it is today."
python
parser
parsing
binary
datastructures
data
structure
from delicious
january 2011 by jonty
Doc⚡split
december 2010 by jonty
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
ruby
pdf
document
parsing
ocr
documents
data
processing
split
from delicious
december 2010 by jonty
The Infinite Monkeywrench
december 2010 by jonty
"... is a collection of tools to download, clean, process, and package datasets from a variety of sources (HTML, RSS, XML, CSV, &c) into a variety of formats (XML, CSV, Excel, JSON, SQL, YAML, &c). Interacting with IMW is as simple as creating a YAML file which describes the workflow involved in processing the data and feeding it to the imw command line program."
data
ruby
processing
process
parsing
csv
yaml
xml
json
rss
html
format
parser
december 2010 by jonty
lxml - Working with links
november 2010 by jonty
"There are several methods on elements that allow you to see and modify the links in a document."
python
parsing
lxml
linkextractor
link
beautifulsoup
november 2010 by jonty
OpenNLP
june 2010 by jonty
"OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. We'll also try to keep a fairly up-to-date list of useful links related to NLP software in general. OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package."
nlp
java
linguistics
opennlp
machinelearning
language
parsing
text
june 2010 by jonty
Latent Semantic Analysis (LSA) Tutorial
march 2010 by jonty
Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.
tutorial
analysis
text
lsa
parsing
keyword
extraction
clustering
march 2010 by jonty
Salmon Run: Summarization with Lucene
august 2009 by jonty
My LuceneSummarizer tokenizes the input into paragraphs, and the paragraphs into sentences, then writes each sentence out to an in-memory Lucene index. It then computes the term frequency map of the index to find the most frequent words found in the document, takes the top few terms and hits the index with a BooleanQuery to find the most relevant sentences. The top few sentences (ordered by docId) thus found constitute the summary
java
ai
nlp
lucene
summarisation
search
text
algorithms
parsing
summarise
august 2009 by jonty
http://rapidxml.sourceforge.net/
april 2009 by jonty
RapidXml is an attempt to create the fastest XML parser possible, while retaining useability, portability and reasonable W3C compatibility. It is an in-situ parser written in modern C++, with parsing speed approaching that of strlen function executed on the same data.
programming
performance
c++
boost
xml
parser
parsing
april 2009 by jonty
related tags
ai ⊕ algorithms ⊕ analysis ⊕ beautifulsoup ⊕ binary ⊕ boost ⊕ c++ ⊕ clustering ⊕ csv ⊕ data ⊕ datamining ⊕ datastructures ⊕ document ⊕ documents ⊕ dom ⊕ extraction ⊕ flickr ⊕ format ⊕ google ⊕ graph ⊕ html ⊕ java ⊕ json ⊕ keyword ⊕ language ⊕ linguistics ⊕ link ⊕ linkextractor ⊕ lsa ⊕ lucene ⊕ lxml ⊕ machinelearning ⊕ ngram ⊕ nlp ⊕ ocr ⊕ opennlp ⊕ parser ⊕ parsing ⊖ pdf ⊕ performance ⊕ process ⊕ processing ⊕ programming ⊕ python ⊕ rss ⊕ ruby ⊕ search ⊕ sentiment ⊕ split ⊕ structure ⊕ summarisation ⊕ summarise ⊕ text ⊕ tutorial ⊕ twitter ⊕ visualisation ⊕ web ⊕ wikipedia ⊕ wordnet ⊕ xml ⊕ yaml ⊕Copy this bookmark: