jonty + parsing   10

SentiWordNet
"SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity"
sentiment  nlp  wordnet  language  parsing  ai  text  from delicious
april 2011 by jonty
Pattern
"Pattern is a web mining module for the Python programming language. It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks)."
python  datamining  nlp  web  data  parsing  text  language  twitter  google  wikipedia  sentiment  analysis  flickr  lsa  wordnet  ngram  html  dom  parser  graph  visualisation  from delicious
february 2011 by jonty
construct
"Construct is a python library for parsing and building of data structures (binary or textual). It is based on the concept of defining data structures in a declarative manner, rather than procedural code: more complex constructs are composed of a hierarchy of simpler ones. It's the first library that makes parsing fun, instead of the usual headache it is today."
python  parser  parsing  binary  datastructures  data  structure  from delicious
january 2011 by jonty
Doc⚡split
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
ruby  pdf  document  parsing  ocr  documents  data  processing  split  from delicious
december 2010 by jonty
The Infinite Monkeywrench
"... is a collection of tools to download, clean, process, and package datasets from a variety of sources (HTML, RSS, XML, CSV, &c) into a variety of formats (XML, CSV, Excel, JSON, SQL, YAML, &c). Interacting with IMW is as simple as creating a YAML file which describes the workflow involved in processing the data and feeding it to the imw command line program."
data  ruby  processing  process  parsing  csv  yaml  xml  json  rss  html  format  parser 
december 2010 by jonty
lxml - Working with links
"There are several methods on elements that allow you to see and modify the links in a document."
python  parsing  lxml  linkextractor  link  beautifulsoup 
november 2010 by jonty
OpenNLP
"OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. We'll also try to keep a fairly up-to-date list of useful links related to NLP software in general. OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package."
nlp  java  linguistics  opennlp  machinelearning  language  parsing  text 
june 2010 by jonty
Latent Semantic Analysis (LSA) Tutorial
Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.
tutorial  analysis  text  lsa  parsing  keyword  extraction  clustering 
march 2010 by jonty
Salmon Run: Summarization with Lucene
My LuceneSummarizer tokenizes the input into paragraphs, and the paragraphs into sentences, then writes each sentence out to an in-memory Lucene index. It then computes the term frequency map of the index to find the most frequent words found in the document, takes the top few terms and hits the index with a BooleanQuery to find the most relevant sentences. The top few sentences (ordered by docId) thus found constitute the summary
java  ai  nlp  lucene  summarisation  search  text  algorithms  parsing  summarise 
august 2009 by jonty
http://rapidxml.sourceforge.net/
RapidXml is an attempt to create the fastest XML parser possible, while retaining useability, portability and reasonable W3C compatibility. It is an in-situ parser written in modern C++, with parsing speed approaching that of strlen function executed on the same data.
programming  performance  c++  boost  xml  parser  parsing 
april 2009 by jonty

Copy this bookmark:



description:


tags: