Lucid Imagination » Accessing words around a positional match in Lucene
25 days ago
Given a term match in a document, what’s the best way to get a window of words around that match?
lucene
programming
informationretrieval
ngram
25 days ago
The Intelius Nickname Collection: Quantitative Analyses from Billions of Public Records
4 weeks ago
Although first names and nicknames in the United States have been well documented, there has been almost no quantitative analysis on the usage and association of these names amongst themselves. In this paper we introduce the Intelius Nickname Collection, a quantitative compilation of millions of name-nickname associations based on information gathered from billions of public records. To
the best of our knowledge, this is the largest collection of its kind, making it a natural resource for tasks such as coreference resolution, record linkage, named entity recognition, people and expert search, information extraction, demographic and sociological studies, etc. The collection will be made freely available.
names
people
research
nlp
informationextraction
paper
the best of our knowledge, this is the largest collection of its kind, making it a natural resource for tasks such as coreference resolution, record linkage, named entity recognition, people and expert search, information extraction, demographic and sociological studies, etc. The collection will be made freely available.
4 weeks ago
Christopia » Blog Archive » Installing and Running Opinion Finder for Sentiment Analysis
4 weeks ago
Describes all workarounds necessary to install OpinionFinder on a current system
opinionfinder
installation
opinionmining
sentimentanalysis
research
useful
4 weeks ago
Bill McDonald's Word Lists Page
7 weeks ago
Financial Sentiment Dictionaries & other word lists
sentimentanalysis
optimization
nlp
research
datasets
resources
wordlist
financial
7 weeks ago
UBY
8 weeks ago
UBY is a large-scale lexical-semantic resource for natural language processing (NLP) based on the ISO standard Lexical Markup Framework (LMF). UBY combines a wide range of information from expert-constructed and collaboratively constructed resources for English and German. Currently, UBY holds structurally and semantically interoperable versions of nine resources in two languages:
English WordNet, Wiktionary, Wikipedia, FrameNet and VerbNet,
German Wikipedia, Wiktionary and GermaNet, and multilingual OmegaWiki.
nlp
resources
research
wordnet
wikipedia
wiktionary
german
english
English WordNet, Wiktionary, Wikipedia, FrameNet and VerbNet,
German Wikipedia, Wiktionary and GermaNet, and multilingual OmegaWiki.
8 weeks ago
WikiTrust
10 weeks ago
WikiTrust is an open-source, on-line reputation system for Wikipedia authors and content. WikiTrust is hosted by the Institute for Scalable Scientific Data Management at the School of Engineering of the University of California, Santa Cruz.
To use WikiTrust, you need to install a Firefox add-on, and then visit one of the Wikipedias on which it is active (currently, the the English, French, German, or Polish Wikipedias). You will see a WikiTrust tab. If you click on it, you will see the text of the Wikipedia, colored according to the degree with which it has been revised by high-reputation authors:
High reputation text, revised by many high-reputation colors, will appear over a white background.
Low-reputation text, which has not benefitted yet from revision by multiple, high-reputation users, is displayed over an orange background: the more intense the orange, the lower the reputation of text.
In this way, WikiTrust will help you spot recent, unrevised changes to Wikipedia pages. Furthermore, if you ALT-click on a word, you will be taken to the diff where that word (in that context) was first introduced in the article: this enables you to trace the text back to its authors.
wikipedia
trust
authorship
interesting
To use WikiTrust, you need to install a Firefox add-on, and then visit one of the Wikipedias on which it is active (currently, the the English, French, German, or Polish Wikipedias). You will see a WikiTrust tab. If you click on it, you will see the text of the Wikipedia, colored according to the degree with which it has been revised by high-reputation authors:
High reputation text, revised by many high-reputation colors, will appear over a white background.
Low-reputation text, which has not benefitted yet from revision by multiple, high-reputation users, is displayed over an orange background: the more intense the orange, the lower the reputation of text.
In this way, WikiTrust will help you spot recent, unrevised changes to Wikipedia pages. Furthermore, if you ALT-click on a word, you will be taken to the diff where that word (in that context) was first introduced in the article: this enables you to trace the text back to its authors.
10 weeks ago
brat rapid annotation tool
11 weeks ago
brat is a web-based tool for text annotation; that is, for adding notes to existing text documents.
brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and "interpreted" by a computer.
annotation
research
nlp
corpus
tools
brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and "interpreted" by a computer.
11 weeks ago
Pattern | CLiPS
11 weeks ago
Pattern is a web mining module for the Python programming language.
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, KNN, SVM), and data visualization (graph networks).
The module is bundled with 30+ example scripts and 350+ unit tests.
datamining
nlp
python
library
webmining
textmining
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, KNN, SVM), and data visualization (graph networks).
The module is bundled with 30+ example scripts and 350+ unit tests.
11 weeks ago
kiama - A Scala library for language processing - Google Project Hosting
february 2012
Kiama is a Scala library for language processing. It enables convenient analysis and transformation of structured data. The programming styles supported by the library are based on well-known formal language processing paradigms, including attribute grammars, tree rewriting, abstract state machines, and pretty printing.
Kiama is a project of the Programming Languages Research Group in the Department of Computing at Macquarie University and is led by Tony Sloane (inkytonik on GMail and Twitter). Other participants at Macquarie are Dominic Verity and the PLRG group students.
Collaborators on the Kiama project include the Software Engineering Research Group at the Delft University of Technology in The Netherlands, notably Eelco Visser and his student Lennart Kats.
library
scala
research
opensource
parser
Kiama is a project of the Programming Languages Research Group in the Department of Computing at Macquarie University and is led by Tony Sloane (inkytonik on GMail and Twitter). Other participants at Macquarie are Dominic Verity and the PLRG group students.
Collaborators on the Kiama project include the Software Engineering Research Group at the Delft University of Technology in The Netherlands, notably Eelco Visser and his student Lennart Kats.
february 2012
Sylvester UGC Tokenizer
january 2012
Sylvester UGC Tokenizer is a simple tool that is capable of splitting noisy text into segments, such as words, punctuation blocks, URLs, smileys, and so on. Most tokenizers were made to handle clean text, and can corrupt noisy messages, (e. g. Twitter posts). We use a text classification approach, achieving significantly better results.
tokenizer
usergeneratedcontent
research
library
python
nlp
twitter
january 2012
SVM-Light Support Vector Machine
january 2012
SVMlight is an implementation of Support Vector Machines (SVMs) in C. The main features of the program are the following:
fast optimization algorithm
working set selection based on steepest feasible descent
"shrinking" heuristic
caching of kernel evaluations
use of folding in the linear case
solves classification and regression problems. For multivariate and structured outputs use SVMstruct.
solves ranking problems (e. g. learning retrieval functions in STRIVER search engine).
computes XiAlpha-estimates of the error rate, the precision, and the recall
efficiently computes Leave-One-Out estimates of the error rate, the precision, and the recall
includes algorithm for approximately training large transductive SVMs (TSVMs) (see also Spectral Graph Transducer)
can train SVMs with cost models and example dependent costs
allows restarts from specified vector of dual variables
handles many thousands of support vectors
handles several hundred-thousands of training examples
supports standard kernel functions and lets you define your own
uses sparse vector representation
machinelearning
svm
research
library
c
java
fast optimization algorithm
working set selection based on steepest feasible descent
"shrinking" heuristic
caching of kernel evaluations
use of folding in the linear case
solves classification and regression problems. For multivariate and structured outputs use SVMstruct.
solves ranking problems (e. g. learning retrieval functions in STRIVER search engine).
computes XiAlpha-estimates of the error rate, the precision, and the recall
efficiently computes Leave-One-Out estimates of the error rate, the precision, and the recall
includes algorithm for approximately training large transductive SVMs (TSVMs) (see also Spectral Graph Transducer)
can train SVMs with cost models and example dependent costs
allows restarts from specified vector of dual variables
handles many thousands of support vectors
handles several hundred-thousands of training examples
supports standard kernel functions and lets you define your own
uses sparse vector representation
january 2012
GATE.ac.uk - projects/neon/termraider.html
december 2011
The idea behind TermRaider is the automated domain-specific provision of term candidates. It is implemented as part of the GATE Web Services plugin in the NeOn toolkit.
nlp
software
tools
gate
research
term_detection
december 2011
Comment #15 : Bug #432785 : Bugs : eCryptfs
december 2011
How to disable encrypted swap to re-enable resume from hibernate
ubuntu
linux
administration
encryption
hibernate
december 2011
Publikative.org » Blog Archive » Hintergrund: Die Extremismustheorie
november 2011
Zusammengefasst: Der Extremismus-Begriff wurde ohne klar identifizierbare Begründung eingeführt; er ist in der Wissenschaft äußerst umstritten hat aber aus staatlicher Sicht seine Berechtigung. Der Begriff gibt keine Hinweise über die Inhalte der dahinterstehenden Ideologien, dies soll durch Erweiterungen wie Rechts-, Links- oder Ausländerextremismus geleistet werden. Die Idee, der Rechtsextremismus sei ein Phänomen eines politischen “Rands”, würdigt nicht die komplexen Ursachen des Rechtsextremismus.
politics
germany
extremism
november 2011
N-grams: corpus based (COCA, COHA, Spanish, Portuguese)
november 2011
These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 425 million word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.
ngram
corpus
list
resources
research
linguistics
nlp
english
november 2011
Geheimdienste: Hauptsache, es macht peng! - Debatten - FAZ
november 2011
„Heute können wir nur ihr völliges Versagen feststellen […]. Die Dienste dienen nur sich selbst. Es ist darum richtig, sie aufzulösen.“
politics
artikel
germany
faz
november 2011
RapidMiner Extensions | Data Mining Portal
october 2011
RapidMiner is the open source data mining solution used within e-Lico for executing data mining operators and workflows. Within e-Lico, we have developed various extensions for RapidMiner.
Using the RapidMiner Community Extension, the user can share data mining workflows on the myexperiment.org portal.
The Image Mining Extension uses the image mining Web service provided by NHRF to execute image mining methods within RapidMiner.
The Market Basket Analysis Extension provides the Rapid Miner operators that build upon the association rule mining framework, but provide additional analytic capabilities beyond simple associations.
rapidminer
extension
plugin
datamining
research
tools
Using the RapidMiner Community Extension, the user can share data mining workflows on the myexperiment.org portal.
The Image Mining Extension uses the image mining Web service provided by NHRF to execute image mining methods within RapidMiner.
The Market Basket Analysis Extension provides the Rapid Miner operators that build upon the association rule mining framework, but provide additional analytic capabilities beyond simple associations.
october 2011
Find out what is using your swap
october 2011
Have you ever logged in to a server, ran `free`, seen that a bit of swap is used and wondered what’s in there? It’s usually not very indicative of anything, or even overly helpful knowing what’s in there, mostly it’s a curiosity thing.
Either way, starting from kernel 2.6.16, we can find out using smaps which can be found in the proc filesystem. I’ve written a simple bash script which prints out all running processes and their swap usage.
It’s quick and dirty, but does the job and can easily be modified to work on any info exposed in /proc/$PID/smaps
If I find the time and inspiration, I might tidy it up and extend it a bit to cover some more alternatives. The output is in kilobytes.
linux
unix
administration
Either way, starting from kernel 2.6.16, we can find out using smaps which can be found in the proc filesystem. I’ve written a simple bash script which prints out all running processes and their swap usage.
It’s quick and dirty, but does the job and can easily be modified to work on any info exposed in /proc/$PID/smaps
If I find the time and inspiration, I might tidy it up and extend it a bit to cover some more alternatives. The output is in kilobytes.
october 2011
academia
administration
advice
ai
ajax
algorithm
algorithms
analysis
animation
apache
api
architecture
article
atiml
audio
bash
beamer
bibliography
bibtex
blog
blogs
book
books
brain
calendar
catalyst
charts
classification
code
collaboration
collaborative
comic
comics
community
comparison
computerscience
computing
conference
cool
copyright
corpus
crawling
css
culture
data
database
datamining
dataset
datasets
design
developer
development
discourse
documentation
download
drawing
editor
education
emacs
email
english
enron
evolution
extension
facebook
filetype:pdf
film
firefox
foaf
folksonomy
fonts
framework
free
fun
funny
geek
generator
german
germany
geschichte
git
google
graph
graphics
gui
hadoop
haskell
hci
history
howto
html
http
humor
humour
i18n
ical
icons
images
information
informationextraction
informationretrieval
interesting
interface
internet
internetprojekt
java
javadoc
javascript
jokes
jquery
laborpraktikum
language
languages
last.fm
latex
learning
library
linguistics
linux
list
machinelearning
mapreduce
markdown
markup
math
media:document
metadata
microformats
microsoft
mmtech
montypython
mozilla
music
namedentity
nerd
networks
ngram
nlp
ontology
oop
opensource
opinionmining
owl
paper
parser
pdf
people
perl
phd
philosophy
photography
photos
pim
plugin
politics
pos
praktikum
presentation
privacy
productivity
programming
project
projektserver
psychology
publishing
python
r
rdf
reference
religion
research
researcher
resources
rest
rezepte
satire
scala
science
search
security
semantics
semanticweb
sentencesplitting
sentimentanalysis
server
shell
sicherheit
slides
social
socialnetworks
society
software
sopra
spam
specification
speechacts
ssh
statistics
stemming
studium
svm
svn
tagging
teaching
technology
ted
telepolis
testing
tex
text
textmining
thunderbird
tips
todo
tokenizer
tools
tutorial
typography
ubuntu
uni
unicode
unix
uri
usability
useful
versioncontrol
via:atlamp
video
vim
visualization
w3c
web
web2.0
webdesign
wiki
wikipedia
wordnet
writing
xul