joelcarranza + nlp 11
Maui - Multi-purpose automatic topic indexing
january 2012 by joelcarranza
Maui automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles.
nlp
january 2012 by joelcarranza
ReVerb - Open Information Extraction Software
september 2011 by joelcarranza
ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.
nlp
september 2011 by joelcarranza
SentiWordNet
may 2011 by joelcarranza
SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. SentiWordNet is described in details in the papers:
nlp
may 2011 by joelcarranza
Corpus-based word frequency lists, collocates, and n-grams
may 2011 by joelcarranza
This site contains what we believe is the most accurate frequency data of English. It contains word frequency lists of the top 60,000 words (lemmas) in English, collocates lists (looking at nearby words to see word meaning and use), and n-grams (the frequency of all two and three-word sequences in the corpora).
nlp
may 2011 by joelcarranza
Natural Language Toolkit
april 2011 by joelcarranza
Analyzing Text with the Natural Language Toolkit
nlp
python
ebook
thlinx
april 2011 by joelcarranza
Corpus of Contemporary American English (COCA)
april 2011 by joelcarranza
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. It was created by Mark Davies of Brigham Young University in 2008, and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created or modified, including the British National Corpus (our architecture and interface), the 100 million word TIME Corpus (1920s-2000s), and the new 400 million word Corpus of Historical American English (COHA; 1810-2009).
The corpus contains more than 410 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2010 and the corpus is also updated once or twice a year (the most recent texts are from Summer 2010). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2010 article in Literary and Linguistic Computing).
nlp
thlinx
The corpus contains more than 410 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2010 and the corpus is also updated once or twice a year (the most recent texts are from Summer 2010). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2010 article in Literary and Linguistic Computing).
april 2011 by joelcarranza
Google Ngram Viewer
april 2011 by joelcarranza
Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set).
Each of the links below will directly download a fragment of the given corpus. For instance, the first hundred links below collectively comprise the 1-gram (i.e., individual words) counts for English, as collected from Google's scanned books around July 15, 2009. Details on the corpus construction can be found in the Science article written by J.B. Michel et al. but are abbreviated here.
nlp
thlinx
Each of the links below will directly download a fragment of the given corpus. For instance, the first hundred links below collectively comprise the 1-gram (i.e., individual words) counts for English, as collected from Google's scanned books around July 15, 2009. Details on the corpus construction can be found in the Science article written by J.B. Michel et al. but are abbreviated here.
april 2011 by joelcarranza
YAGO2
april 2011 by joelcarranza
YAGO2 is a huge semantic knowledge base, derived from Wikipedia, WordNet and GeoNames. Currently, YAGO2 has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 80 million facts about these entities.
In YAGO2, we made an effort to treat time and location data as first-class citizen, extending the basic triple model by special fields for time and location for querying. Also, we took special care to consistently attach temporal and spatial data to all facts where it is semantically meaningful and where time and location can be derived from Wikipedia. Unlike many other automatically assembled knowledge bases, YAGO2 has a confirmed accuracy of 95%.
via:mootPoint
nlp
thlinx
In YAGO2, we made an effort to treat time and location data as first-class citizen, extending the basic triple model by special fields for time and location for querying. Also, we took special care to consistently attach temporal and spatial data to all facts where it is semantically meaningful and where time and location can be derived from Wikipedia. Unlike many other automatically assembled knowledge bases, YAGO2 has a confirmed accuracy of 95%.
april 2011 by joelcarranza
WordNet
april 2011 by joelcarranza
WordNet® is a large lexical database of English, developed under the direction of George A. Miller (Emeritus). Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
nlp
thlinx
april 2011 by joelcarranza
Word Frequency Counts - words frequency programming | Ask MetaFilter
april 2011 by joelcarranza
Can I compute how frequently a word occurs in general English text? I have a list of about 2000 words, and I want to sort it with the most common words first.
nlp
thlinx
april 2011 by joelcarranza
Natural Language Processing for the Working Programmer
november 2010 by joelcarranza
CC licensed ebook on working with text
ebook
nlp
thlinx
license:cc
november 2010 by joelcarranza