Lucid Imagination » Accessing words around a positional match in Lucene
Given a term match in a document, what’s the best way to get a window of words around that match?
lucene  programming  informationretrieval  ngram 
25 days ago
The Intelius Nickname Collection: Quantitative Analyses from Billions of Public Records
Although first names and nicknames in the United States have been well documented, there has been almost no quantitative analysis on the usage and association of these names amongst themselves. In this paper we introduce the Intelius Nickname Collection, a quantitative compilation of millions of name-nickname associations based on information gathered from billions of public records. To
the best of our knowledge, this is the largest collection of its kind, making it a natural resource for tasks such as coreference resolution, record linkage, named entity recognition, people and expert search, information extraction, demographic and sociological studies, etc. The collection will be made freely available.
names  people  research  nlp  informationextraction  paper 
4 weeks ago
UBY
UBY is a large-scale lexical-semantic resource for natural language processing (NLP) based on the ISO standard Lexical Markup Framework (LMF). UBY combines a wide range of information from expert-constructed and collaboratively constructed resources for English and German. Currently, UBY holds structurally and semantically interoperable versions of nine resources in two languages:

English WordNet, Wiktionary, Wikipedia, FrameNet and VerbNet,
German Wikipedia, Wiktionary and GermaNet, and multilingual OmegaWiki.
nlp  resources  research  wordnet  wikipedia  wiktionary  german  english 
8 weeks ago
WikiTrust
WikiTrust is an open-source, on-line reputation system for Wikipedia authors and content. WikiTrust is hosted by the Institute for Scalable Scientific Data Management at the School of Engineering of the University of California, Santa Cruz.

To use WikiTrust, you need to install a Firefox add-on, and then visit one of the Wikipedias on which it is active (currently, the the English, French, German, or Polish Wikipedias). You will see a WikiTrust tab. If you click on it, you will see the text of the Wikipedia, colored according to the degree with which it has been revised by high-reputation authors:

High reputation text, revised by many high-reputation colors, will appear over a white background.
Low-reputation text, which has not benefitted yet from revision by multiple, high-reputation users, is displayed over an orange background: the more intense the orange, the lower the reputation of text.

In this way, WikiTrust will help you spot recent, unrevised changes to Wikipedia pages. Furthermore, if you ALT-click on a word, you will be taken to the diff where that word (in that context) was first introduced in the article: this enables you to trace the text back to its authors.
wikipedia  trust  authorship  interesting 
10 weeks ago
brat rapid annotation tool
brat is a web-based tool for text annotation; that is, for adding notes to existing text documents.

brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and "interpreted" by a computer.
annotation  research  nlp  corpus  tools 
11 weeks ago
Pattern | CLiPS
Pattern is a web mining module for the Python programming language.

It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, KNN, SVM), and data visualization (graph networks).

The module is bundled with 30+ example scripts and 350+ unit tests.
datamining  nlp  python  library  webmining  textmining 
11 weeks ago
kiama - A Scala library for language processing - Google Project Hosting
Kiama is a Scala library for language processing. It enables convenient analysis and transformation of structured data. The programming styles supported by the library are based on well-known formal language processing paradigms, including attribute grammars, tree rewriting, abstract state machines, and pretty printing.

Kiama is a project of the Programming Languages Research Group in the Department of Computing at Macquarie University and is led by Tony Sloane (inkytonik on GMail and Twitter). Other participants at Macquarie are Dominic Verity and the PLRG group students.

Collaborators on the Kiama project include the Software Engineering Research Group at the Delft University of Technology in The Netherlands, notably Eelco Visser and his student Lennart Kats.
library  scala  research  opensource  parser 
february 2012
Sylvester UGC Tokenizer
Sylvester UGC Tokenizer is a simple tool that is capable of splitting noisy text into segments, such as words, punctuation blocks, URLs, smileys, and so on. Most tokenizers were made to handle clean text, and can corrupt noisy messages, (e. g. Twitter posts). We use a text classification approach, achieving significantly better results.
tokenizer  usergeneratedcontent  research  library  python  nlp  twitter 
january 2012
SVM-Light Support Vector Machine
SVMlight is an implementation of Support Vector Machines (SVMs) in C. The main features of the program are the following:

fast optimization algorithm
working set selection based on steepest feasible descent
"shrinking" heuristic
caching of kernel evaluations
use of folding in the linear case
solves classification and regression problems. For multivariate and structured outputs use SVMstruct.
solves ranking problems (e. g. learning retrieval functions in STRIVER search engine).
computes XiAlpha-estimates of the error rate, the precision, and the recall
efficiently computes Leave-One-Out estimates of the error rate, the precision, and the recall
includes algorithm for approximately training large transductive SVMs (TSVMs) (see also Spectral Graph Transducer)
can train SVMs with cost models and example dependent costs
allows restarts from specified vector of dual variables
handles many thousands of support vectors
handles several hundred-thousands of training examples
supports standard kernel functions and lets you define your own
uses sparse vector representation
machinelearning  svm  research  library  c  java 
january 2012
GATE.ac.uk - projects/neon/termraider.html
The idea behind TermRaider is the automated domain-specific provision of term candidates. It is implemented as part of the GATE Web Services plugin in the NeOn toolkit.
nlp  software  tools  gate  research  term_detection 
december 2011
Comment #15 : Bug #432785 : Bugs : eCryptfs
How to disable encrypted swap to re-enable resume from hibernate
ubuntu  linux  administration  encryption  hibernate 
december 2011
Publikative.org » Blog Archive » Hintergrund: Die Extremismustheorie
Zusammengefasst: Der Extremismus-Begriff wurde ohne klar identifizierbare Begründung eingeführt; er ist in der Wissenschaft äußerst umstritten hat aber aus staatlicher Sicht seine Berechtigung. Der Begriff gibt keine Hinweise über die Inhalte der dahinterstehenden Ideologien, dies soll durch Erweiterungen wie Rechts-, Links- oder Ausländerextremismus geleistet werden. Die Idee, der Rechtsextremismus sei ein Phänomen eines politischen “Rands”, würdigt nicht die komplexen Ursachen des Rechtsextremismus.
politics  germany  extremism 
november 2011
N-grams: corpus based (COCA, COHA, Spanish, Portuguese)
These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 425 million word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.
ngram  corpus  list  resources  research  linguistics  nlp  english 
november 2011
Geheimdienste: Hauptsache, es macht peng! - Debatten - FAZ
„Heute können wir nur ihr völliges Versagen feststellen […]. Die Dienste dienen nur sich selbst. Es ist darum richtig, sie aufzulösen.“
politics  artikel  germany  faz 
november 2011
RapidMiner Extensions | Data Mining Portal
RapidMiner is the open source data mining solution used within e-Lico for executing data mining operators and workflows. Within e-Lico, we have developed various extensions for RapidMiner.

Using the RapidMiner Community Extension, the user can share data mining workflows on the myexperiment.org portal.

The Image Mining Extension uses the image mining Web service provided by NHRF to execute image mining methods within RapidMiner.

The Market Basket Analysis Extension provides the Rapid Miner operators that build upon the association rule mining framework, but provide additional analytic capabilities beyond simple associations.
rapidminer  extension  plugin  datamining  research  tools 
october 2011
Find out what is using your swap
Have you ever logged in to a server, ran `free`, seen that a bit of swap is used and wondered what’s in there? It’s usually not very indicative of anything, or even overly helpful knowing what’s in there, mostly it’s a curiosity thing.

Either way, starting from kernel 2.6.16, we can find out using smaps which can be found in the proc filesystem. I’ve written a simple bash script which prints out all running processes and their swap usage.
It’s quick and dirty, but does the job and can easily be modified to work on any info exposed in /proc/$PID/smaps
If I find the time and inspiration, I might tidy it up and extend it a bit to cover some more alternatives. The output is in kilobytes.
linux  unix  administration 
october 2011
« earlier      
academia administration advice ai ajax algorithm algorithms analysis animation apache api architecture article atiml audio bash beamer bibliography bibtex blog blogs book books brain calendar catalyst charts classification code collaboration collaborative comic comics community comparison computerscience computing conference cool copyright corpus crawling css culture data database datamining dataset datasets design developer development discourse documentation download drawing editor education emacs email english enron evolution extension facebook filetype:pdf film firefox foaf folksonomy fonts framework free fun funny geek generator german germany geschichte git google graph graphics gui hadoop haskell hci history howto html http humor humour i18n ical icons images information informationextraction informationretrieval interesting interface internet internetprojekt java javadoc javascript jokes jquery laborpraktikum language languages last.fm latex learning library linguistics linux list machinelearning mapreduce markdown markup math media:document metadata microformats microsoft mmtech montypython mozilla music namedentity nerd networks ngram nlp ontology oop opensource opinionmining owl paper parser pdf people perl phd philosophy photography photos pim plugin politics pos praktikum presentation privacy productivity programming project projektserver psychology publishing python r rdf reference religion research researcher resources rest rezepte satire scala science search security semantics semanticweb sentencesplitting sentimentanalysis server shell sicherheit slides social socialnetworks society software sopra spam specification speechacts ssh statistics stemming studium svm svn tagging teaching technology ted telepolis testing tex text textmining thunderbird tips todo tokenizer tools tutorial typography ubuntu uni unicode unix uri usability useful versioncontrol via:atlamp video vim visualization w3c web web2.0 webdesign wiki wikipedia wordnet writing xul

Copy this bookmark:



description:


tags: