joelcarranza + thlinx 8
wiki.dbpedia.org : Datasets
april 2011 by joelcarranza
The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia. The DBpedia data set currently describes 3.5 million “things” with over halb a billion “facts” (January 2010).
data
thlinx
april 2011 by joelcarranza
Natural Language Toolkit
april 2011 by joelcarranza
Analyzing Text with the Natural Language Toolkit
nlp
python
ebook
thlinx
april 2011 by joelcarranza
Corpus of Contemporary American English (COCA)
april 2011 by joelcarranza
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. It was created by Mark Davies of Brigham Young University in 2008, and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created or modified, including the British National Corpus (our architecture and interface), the 100 million word TIME Corpus (1920s-2000s), and the new 400 million word Corpus of Historical American English (COHA; 1810-2009).
The corpus contains more than 410 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2010 and the corpus is also updated once or twice a year (the most recent texts are from Summer 2010). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2010 article in Literary and Linguistic Computing).
nlp
thlinx
The corpus contains more than 410 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2010 and the corpus is also updated once or twice a year (the most recent texts are from Summer 2010). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2010 article in Literary and Linguistic Computing).
april 2011 by joelcarranza
Google Ngram Viewer
april 2011 by joelcarranza
Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set).
Each of the links below will directly download a fragment of the given corpus. For instance, the first hundred links below collectively comprise the 1-gram (i.e., individual words) counts for English, as collected from Google's scanned books around July 15, 2009. Details on the corpus construction can be found in the Science article written by J.B. Michel et al. but are abbreviated here.
nlp
thlinx
Each of the links below will directly download a fragment of the given corpus. For instance, the first hundred links below collectively comprise the 1-gram (i.e., individual words) counts for English, as collected from Google's scanned books around July 15, 2009. Details on the corpus construction can be found in the Science article written by J.B. Michel et al. but are abbreviated here.
april 2011 by joelcarranza
YAGO2
april 2011 by joelcarranza
YAGO2 is a huge semantic knowledge base, derived from Wikipedia, WordNet and GeoNames. Currently, YAGO2 has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 80 million facts about these entities.
In YAGO2, we made an effort to treat time and location data as first-class citizen, extending the basic triple model by special fields for time and location for querying. Also, we took special care to consistently attach temporal and spatial data to all facts where it is semantically meaningful and where time and location can be derived from Wikipedia. Unlike many other automatically assembled knowledge bases, YAGO2 has a confirmed accuracy of 95%.
via:mootPoint
nlp
thlinx
In YAGO2, we made an effort to treat time and location data as first-class citizen, extending the basic triple model by special fields for time and location for querying. Also, we took special care to consistently attach temporal and spatial data to all facts where it is semantically meaningful and where time and location can be derived from Wikipedia. Unlike many other automatically assembled knowledge bases, YAGO2 has a confirmed accuracy of 95%.
april 2011 by joelcarranza
WordNet
april 2011 by joelcarranza
WordNet® is a large lexical database of English, developed under the direction of George A. Miller (Emeritus). Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
nlp
thlinx
april 2011 by joelcarranza
Word Frequency Counts - words frequency programming | Ask MetaFilter
april 2011 by joelcarranza
Can I compute how frequently a word occurs in general English text? I have a list of about 2000 words, and I want to sort it with the most common words first.
nlp
thlinx
april 2011 by joelcarranza
Natural Language Processing for the Working Programmer
november 2010 by joelcarranza
CC licensed ebook on working with text
ebook
nlp
thlinx
license:cc
november 2010 by joelcarranza