rybesh + textmining   13

Knowledge Base Acceleration (KBA) -- a track in NIST's TREC 2012
The data for TREC KBA 2012 has two components: Target Entities (Filtering Queries) and Stream Corpus (Text Documents).
trec  IR  semweb  textmining 
9 weeks ago by rybesh
Twelve steps to running your Ruby code across five billion web pages | CommonCrawl
A starting point to write your own Ruby algorithms to analyse the wealth of information that’s buried in the Common Crawl web archive.
ec2  hadoop  web  datamining  textmining 
9 weeks ago by rybesh
Automating Quantitative Narrative Analysis of News Data
We present a working system for large scale quantitative narrative analysis (QNA) of news corpora, which includes various recent ideas from text mining and pattern analysis in order to solve a problem arising in computational social sciences. The task is that of identifying the key actors in a body of news, and the actions they perform, so that further analysis can be carried out. This step is normally performed by hand and is very labour intensive. We then characterise the actors by: studying their position in the overall network of actors and actions; studying the time series associated with some of their properties; generating scatter plots describing the subject/object bias of each actor; and investigating the types of actions each actor is most associated with. The system is demonstrated on a set of 100,000 articles about crime appeared on the New York Times between 1987 and 2007. As an example, we nd that Men were most commonly responsible for crimes against the person, while Women and Children were most often victims of those crimes.
textanalysis  textmining  events  sociology  news 
12 weeks ago by rybesh
Latent Dirichlet Allocation in C
This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .

This code contains:

an implementation of variational inference for the per-document topic proportions and per-word topic assignments
a variational EM procedure for estimating the topics and exchangeable Dirichlet hyperparameter
lda  c  linguistics  machinelearning  textanalysis  textmining 
12 weeks ago by rybesh
Conditional Random Fields
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences. The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world tasks in many fields, including bioinformatics, computational linguistics and speech recognition.
machinelearning  nlp  crf  textmining  metadata 
february 2012 by rybesh
The Meaning and The Mining of Legal Texts
Positive law, inscribed in legal texts, entails an authority not inherent in literary texts, generating legal consequences that can have real effects on a person’s life and liberty. The interpretation of legal texts, necessarily a normative undertaking, resists the mechanical application of rules, though still requiring a measure of predictability, coherence with other relevant legal norms and compliance with constitutional safeguards. The present proliferation of legal texts on the internet (codes, statutes, judgments, treaties, doctrinal treatises) renders the selection of relevant texts and cases next to impossible. We may expect that systems to mine these texts to find arguments that support one’s case, as well as expert systems that support the decision-making process of courts, will end up doing much of the work.

This raises the question of the difference between human interpretation and computational pattern-recognition and the issue of whether this difference makes a difference for the meaning of law. Possibly, data mining will produce patterns that disclose habits of the minds of judges and legislators that would have otherwise gone unnoticed (reinforcing the argument of the ‘legal realists’ at the beginning of the 20th century). Also, after the data analysis it will still be up to the judge to decide how to interpret the results or up to the prosecution which patterns to engage in the construction of evidence (requiring a hermeneutics of computational patterns instead of texts). My focus in this paper regards the fact that the mining process necessarily disambiguates the legal texts in order to transform them into a machine-readable data set, while the algorithms used for the analysis embody a strategy that will co-determine the outcome of the patterns. There seems a major due process concern here to the extent that these patterns are invisible for the naked human eye and will not be contestable in a court of law, due to their hidden complexity and computational nature.

This position paper aims to explain what is at stake in the computational turn with regard to legal texts. This prepares for the question I want to put forward to those involved in distant reading and not-reading of texts: could a visualization of computational patterns constitute a new way of un-hiding the complexity involved, opening the results of computational ‘knowledge’ to citizens’ scrutiny?
textmining  machinelearning  visualization  digitalhumanities  law 
january 2012 by rybesh
A panlingual anomalous text detector
In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.
ocr  textanalysis  textmining  evalulation 
october 2011 by rybesh
Price-is-Right Binary Search (for Suffix Arrays of Documents) « LingPipe Blog
Suffix arrays are useful if you’re looking for anything from plagiarized passages in a pile of writing assignments, cut-and-paste code blocks in a large project, or just commonly repeated phrases on Twitter.
search  textanalysis  textmining 
june 2011 by rybesh
Wikipedia Miner - Home
Wikipedia Miner is a toolkit for navigating and making use of the structure and content of Wikipedia. It aims to make it easy for you to integrate Wikipedia's knowledge into your own applications, by:

providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
wikipedia  textmining  nlp  webservices  tools  datamining 
may 2011 by rybesh
tm - Text Mining Package
tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
R  textmining  datamining  nlp  tools  statistics 
october 2010 by rybesh
CRCnetBASE - Text Mining
Giving a broad perspective of the field from numerous vantage points, Text Mining: Classification, Clustering, and Applications focuses on statistical methods for text mining and analysis. It examines methods to automatically cluster and classify text documents and applies these methods in a variety of areas, including adaptive information filtering, information distillation, and text search.

The book begins with chapters on the classification of documents into predefined categories. It presents state-of-the-art algorithms and their use in practice. The next chapters describe novel methods for clustering documents into groups that are not predefined. These methods seek to automatically determine topical structures that may exist in a document corpus. The book concludes by discussing various text mining applications that have significant implications for future research and industrial use.
textmining  nlp 
september 2010 by rybesh

Copy this bookmark:



description:


tags: