rybesh + nlp   149

Coursera - Stanford NLP class
Jurafsky and Manning's online NLP course.
nlp  education 
yesterday by rybesh
gilesc/stanford-corenlp
Clojure wrapper for Stanford CoreNLP tools.
nlp  clojure 
10 days ago by rybesh
dakrone/clojure-opennlp
Clojure library interface to OpenNLP.
clojure  nlp 
10 days ago by rybesh
NEH Digital Humanities Startup Grants: Funding the Future « Early Modern Online Bibliography
The video “How Natural Language Processing is Changing Research” provides a more extended look at WordSeer’s usefulness for analyzing slave narratives, but its purpose is also to underscore how such a tool can benefit humanities scholars. In this video the discussion veers toward presenting reading as a chore from which humanities scholars seek relief. On that note, a student in Dr. Michael Ullyot’s undergraduate ENG 203 course, “Hamlet in the Humanities Lab” at the University of Calgary offers some pertinent comments. In her penultimate blog post for the course, Stephanie Vandework devotes a section to “The Pros and Cons of Exploratory Analysis” and examines more closely the claims in the WordSeer Shakespeare demo, finding some to suffer from overgeneralization. (For a view of the course from the instructor’s perspective, see Dr. Ullyot’s presentation, Teaching Hamlet in the Humanities Lab, for the Renaissance Society of America conference this past March 2012.)
nlp  digitalhumanities  textanalysis 
15 days ago by rybesh
Parsing Time: Learning to Interpret Time Expressions
We present a probabilistic approach for learning to interpret temporal phrases given only a corpus of utterances and the times they reference. While most approaches to the task have used regular expressions and similar linear pattern interpretation rules, the possibility of phrasal embedding and modification in time expressions motivates our use of a compositional grammar of time expressions. This grammar is used to construct a latent parse which evaluates to the time the phrase would represent, as a logical parse might evaluate to a concrete entity. In this way, we can employ a loosely supervised EM-style bootstrapping approach to learn these latent parses while capturing both syntactic uncertainty and pragmatic ambiguity in a probabilistic framework. We achieve an accuracy of 72% on an adapted TempEval-2 task – comparable to state of the art systems.
time  temporal  parsing  nlp 
16 days ago by rybesh
The Stanford NLP (Natural Language Processing) Group
Natural Language Understanding requires a large amount of background "common sense" knowledge about the situation under discussion. In many respects, using this knowledge is at the core of reasoning and acting in traditional Artificial Intelligence. When reading an article about a criminal conviction, the writer assumes the reader knows about trials, juries, and criminal activity. The Narrative Chain project aims to learn this knowledge by processing large amounts of text and learning which events tend to occur together. We are studying not just what can be learned, but also the best representation for this knowledge (graph, linear chain, frame?).

This project also includes research into ordering events in time. For instance, did the conviction or the sentencing happen first? We use modern machine learning techniques to find linguistic features that indicate this semantic ordering relation.

An example of a learned narrative event chain, with arrows indicating temporal ordering, is shown on the right. The bold words are the events, and the subj/obj terms indicate how the common actor in this narrative is involved in the event (the subject or object of the verb).
nlp  events  frames  narrative 
4 weeks ago by rybesh
Computational Linguistics for Literature
The amount of literary material available on-line keeps growing rapidly. Not only are there machine-readable texts in libraries, collections and e-book stores, but there is also more and more “live” literature – e-zines, blogs, self-published e-books and so on. There is a need for tools to help users navigate, visualize and appreciate high volume of available literature.

Literary texts are quite different from technical and formal documents, which have been the focus of NLP research thus far. Most forms of statistical language processing rely on lexical information in one way or another. In literature, the primary mode is narrative rather than exposition. Stories may be cognitively easier to read than certain expository genres, such as scientific documents, but it is a challenging form of discourse for NLP tools and methods. For instance, literary prose lacks overt lexical clues and structural markers typically leveraged in the processing of more structured genres. Also, even conventional literary texts exhibit far less unity of time, space and topic than most formal discourse. Learning to handle these challenges in literary data may help move past heavy reliance on surface clues in general.

Literature also differs from other genres because of the needs of its typical audience. For instance, reading, searching or browsing literature online is a different task than searching for the latest news on a particular topic. Search criteria would be rather abstract: not a keyword, but a literary style, similarity to another work, point of view and so on. When looking for a summary or a digest, a reader may prefer to know or visualize a text's broad characteristics than facts which summarize the plot.

We invite papers that touch upon these areas, but also welcome other ideas which promote the processing of literary narrative or related forms of discourse.
literature  nlp  digitalhumanities  narrative 
4 weeks ago by rybesh
discourse structure reading group
Daniel Marcu's discourse structure reading group at ISI.
nlp  discourse 
5 weeks ago by rybesh
digital digs: the role of summary in composition
The obvious question is how one manages to distinguish among summary, analysis, argument, and interpretation. E.g.

With the aid of a rag tag crew of adventurers, a young man rescues a princess from an evil empire and discovers his destiny to become a member of a dying order of knights.

A young man helps a rebel leader escape from an imperial prison and participates in an pitched battle to save the rebels' military base.

I assume you recognize the story, and I think most people would say the first summary is more accurate. Why? The second one is certainly not inaccurate. It simply downplays the "hero's journey" aspect and portrays the film as depicting a political and collective activity.
narrative  language  events  perspective  frames  nlp 
8 weeks ago by rybesh
Index of /WordNet-Pairs
What are the N most similar words to X, according to WordNet?

This data seeks to answer that question, where similarity is based on
measures from WordNet::Similarity. http://wn-similarity.sourceforge.net
nlp  wordnet  opendata 
8 weeks ago by rybesh
Apache Stanbol - Welcome to Apache Stanbol (incubating)
Apache Stanbol (currently in incubation) is an open source modular software stack and reusable set of components for semantic content management.

Apache Stanbol components are meant to be accessed over RESTful interfaces to provide semantic services for content management. Thus, one application is to extend traditional content management systems with (internal or external) semantic services.
nlp  semweb  CMS  tools  editorsnotes 
8 weeks ago by rybesh
Digital Humanities 2011 tutorial
Chris Manning's tutorial at Digital Humanities 2011 at Stanford.
nlp  tutorial  digitalhumanities 
11 weeks ago by rybesh
Johansson & Nugues - LTH: Semantic Structure Extraction using Nonprojective Dependency Trees
We describe our contribution to the SemEval task on Frame-Semantic Structure Extraction. Unlike most previous systems described in literature, ours is based on dependency syntax. We also describe a fully automatic method to add words to the FrameNet lexical database, which gives an improvement in the recall of frame detection.
nlp  frames  framenet  parsing 
11 weeks ago by rybesh
Cross Validation vs. Inter-Annotator Agreement « LingPipe Blog
Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct.
nlp  annotation 
11 weeks ago by rybesh
Stanford Topic Modeling Toolbox
Includes an implementation of PLDA.

Partially Labeled Dirchlet Allocation (PLDA) [paper] is a topic model that extends and generalizes both LDA and Labeled LDA. The model is analogous to Labeled LDA except that it allows more than one latent topic per label and a set of background labels. Learning and inference in the model is much like the example above for Labeled LDA, but you must additionally specify the number of topics associated with each label.
lda  plda  metadata  topicmodels  nlp  socialscience  scala 
12 weeks ago by rybesh
Natural Language Software Registry
The Natural Language Software Registry (NLSR) is a concise summary of the capabilities and sources of a large amount of natural language processing (NLP) software available to the NLP community. It comprises academic, commercial and proprietary software with specifications and terms on which it can be acquired clearly indicated.
nlp  linguistics  tools 
february 2012 by rybesh
Robert Young - Text Understanding: A Survey
The goal of the study is to examine work that has something to offer toward the construction of a computable model of text understanding. It focuses on those aspects of meaning that are conveyed only by groups of connected sentences—texts. Additionally, only work that attempts to deal with the semantics or understanding of texts, as opposed to statistical or syntactic analysis, is considered.
nlp  textanalysis  semantics 
february 2012 by rybesh
NLP Ecosystem
The iDASH NLP Ecosystem is a place to share and access tools, data, and educational resources for developing and applying NLP to clinical text. 2011-2012 talks on temporal reasoning.
health  nlp  temporality  time 
february 2012 by rybesh
N-grams: corpus based (COCA, COHA, Spanish, Portuguese)
These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 425 million word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.
english  corpus  linguistics  nlp  ngrams 
february 2012 by rybesh
Conditional Random Fields
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences. The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world tasks in many fields, including bioinformatics, computational linguistics and speech recognition.
machinelearning  nlp  crf  textmining  metadata 
february 2012 by rybesh
splitta - statistical sentence boundary detection
Sentence tokenizer written in python. Includes proper tokenization and models for very high accuracy sentence boundary detection (English only for now). The models are trained from Wall Street Journal news combined with the Brown Corpus which is intended to be widely representative of written English. Error rates on test news data are near 0.25%.
nlp  python 
january 2012 by rybesh
DDupe
Visualizing and analyzing social networks is a challenging problem that has been receiving growing attention. An important first step, before analysis can begin, is ensuring that the data is accurate. A common data quality problem is that the data may inadvertently contain several distinct references to the same underlying entity; the process of reconciling these references is called entity resolution. D-Dupe is an interactive tool that combines data mining algorithms for entity resolution with a task-specific network visualization. Users cope with complexity of cleaning large networks by focusing on a small subnetwork containing a potential duplicate pair. The subnetwork highlights relationships in the social network, making the common relationships easy to visually identify. D-Dupe users resolve ambiguities either by merging nodes or by marking them distinct. The entity resolution process is iterative: as pairs of nodes are resolved, additional duplicates may be revealed; therefore, resolution decisions are often chained together. We give examples of how users can flexibly apply sequences of actions to produce a high quality entity resolution result.
datamining  nlp  networks  visualization 
january 2012 by rybesh
A Unified Event Coreference Resolution by Integrating Multiple Resolvers
Event coreference is an important and complicated task in cascaded event template extraction and other natural language processing tasks. Despite its importance, it was merely discussed in previous studies. In this paper, we present a globally optimized coreference resolution system dedicated to various sophisticated event coreference phenomena. Seven resolvers for both event and object coreference cases are utilized, which include three new resolvers for event coreference resolution. Three enhancements are further proposed at both mention pair detection and chain formation levels. First, the object coreference resolvers are used to effectively reduce the false positive cases for event coreference. Second, A revised instance selection scheme is proposed to improve link level mention-pair model performances. Last but not least, an efficient and globally optimized graph partitioning model is employed for coreference chain formation using spectral partitioning which allows the incorporation of pronoun coreference information. The three techniques contribute to a significant improvement of 8.54% in B 3 F-score for event coreference resolution on OntoNotes 2.0 corpus.
events  nlp  coreference 
december 2011 by rybesh
Aravind K. Joshi - Towards Discourse Meaning
The overall goal is to discuss some issues concerning the dependencies at the discourse level and at the sentence level. However, first I will briefly describe the Penn Discourse Treebank (PDTB)*, a corpus in which we annotate the discourse connectives (explicit and implicit) and their arguments together with "attributions" of the arguments and the relations denoted by the connectives, and also the senses of the connectives. I will then focus on the complexity of dependencies in terms of (a) the elements that bear the dependency relations, (b) graph theoretic properties of these dependencies such as nested and crossed dependencies, dependencies with shared arguments, and (c) attributions and their relationship to the dependencies, among others. I will compare these dependencies with those at the sentence level and discuss some issues that relate to the transition from the sentence level to the level of "immediate discourse" and propose some conjectures.
discourse  meaning  linguistics  nlp 
november 2011 by rybesh
chromium-compact-language-detector - C++ library and Python bindings for detecting language from UTF8 text, extracted from the Chromium browser - Google Project Hosting
This is a straight port from the CLD (Compact Language Detector) library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings.
language  detection  nlp  python 
november 2011 by rybesh
Dan Jurafsky, Syntactic Variations Versus Semantic Roles
Dan Jurafsky, Syntactic Variations Versus Semantic Roles, Some Typical Semantic Roles, Two Solutions To The Difficulty Of Defining Semantic Roles, PropBank, FrameNet, Information Extraction Versus Semantic Role Labeling, Evaluation Measures, Parsing Algorithm, Combining Identification And Classification Models, Summary
nlp  frame  semantics 
october 2011 by rybesh
Inter-Event Dependencies support Event Extraction from Biomedical Literature
The description of events in biomedical literature often follows discourse patterns.
For example, authors may firstly mention the transcription of a
gene, and then go on to describe how this transcription is regulated by another
gene. Capturing such patterns can be beneficial when we want to extract event
mentions from literature. For instance, detecting the mention of a transcription of
gene A gives us a hint to actively look for mentions of regulations involving A.
With this hint we could find such mentions even if they follow unseen lexical or
syntactic patterns. To exploit such hints we need to perform event extraction in a
cross sentence manner.

It is shown that imperatively defined factor graphs (IDF) are an intuitive way to
build Markov Networks that model inter-dependencies between mentions of events
within sentences, and across sentence-boundaries. Small pieces of procedural code
define the graph structure, feature functions and hooks for efficient inference.
Empirically, this leads to an efficient cross-sentence event extractor with very
competitive results on the BioNLP shared task. One of our inter-event features
shows an impact of 1:94 points in F1 for the class of regulation events.
nlp  event  extraction 
october 2011 by rybesh
LingPipe: Competition
On this page, we break our competition down into academic toolkits and industrial toolkits. We only consider software that is available for linguistic processing, not companies that rely on linguistic processing in an application but do not sell that technology.
nlp  software 
september 2011 by rybesh
CCG: Software - Illinois Semantic Role Labeler (SRL)
Semantic Role Labeler is a machine-learning based tool that analyzes for a shallow semantic information of a given sentence. The tool is capable of outputing verb-argument structure following the notation defined by the Propbank project.
frame  semantics  nlp 
september 2011 by rybesh
mate-tools - Tools for Natural Language Analysis, Generation and Machine Learning - Google Project Hosting
The tools provide a pipeline of modules that carry out lemmatization, part-of-speech tagging, dependency parsing, and semantic role labeling of a sentence. The system’s two main components draw on improved versions of a state-of-the-art dependency parser (Bohnet, 2010) and semantic role labeler (Björkelund et al.,2009) developed independently by the authors. The tools are language independent, provide a very high accuracy and are fast. The dependency parser had the top score for German and English dependency parsing in the CoNLL shared task 2009.
nlp  frame  sematics 
september 2011 by rybesh
Detection, Representation, and Exploitation of Events in the Semantic Web Workshop in conjunction with the 10th International Semantic Web Conference 2011 23 October Registration now open at: http://iswc2011.semanticweb.org/attending/registration
In recent years, researchers in several communities involved in aspects of the web have begun to realise the potential benefits of assigning an important role to events in the representation and organisation of knowledge and media. While a good deal of relevant research has been done in the semantic web community (for example on the modeling of events), a lot of complementary research has been done in other communities, such as multimedia processing and information retrieval. The goal of this workshop is to advance research on this general topic within the semantic web community, by both building on existing semantic web work and integrating results and methods from other areas, with a particular focus on issues that are central to the semantic web.
events  modeling  semweb  nlp 
september 2011 by rybesh
Using predicate-argument structures for information extraction
In this paper we present a novel, customizable IE paradigm that takes advantage of predicate-argument structures. We also introduce a new way of automatically identifying predicate argument structures, which is central to our IE paradigm. It is based on: (1) an extended set of features; and (2) inductive decision tree learning. The experimental results prove our claim that accurate predicate-argument structures enable high quality IE results.
frame  semantics  nlp  information  extraction 
august 2011 by rybesh
Semantic Role Labeler
Demo of mate-tools semantic role labeler.
frame  semantics  nlp 
august 2011 by rybesh
Martha Palmer | Projects | Verb Net
VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), Xtag (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998). VerbNet is organized into verb classes extending Levin (1993) classes through refinement and addition of subclasses to achieve syntactic and semantic coherence among members of a class. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function, in a manner similar to the event decomposition of Moens and Steedman (1988).
corpus  linguistics  nlp  language  data  frame  semantics 
august 2011 by rybesh
LDC Catalog
Proposition Bank I was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T14 and ISBN 1-58563-304-6.

This is a semantic annotation of the Wall Street Journal section of Treebank-2. More specifically, each verb occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs have also been tagged with coarse grained senses and with inflectional information. This work was done in the Computer and Information Sciences Department at the University of Pennsylvania.
frame  semantics  nlp  language  data 
august 2011 by rybesh
Martha Palmer | Projects | ACE
The original PropBank project, funded by ACE, created a corpus of text annotated with information about basic semantic propositions. Predicate-argument relations were added to the syntactic trees of the Penn Treebank. This resource is now available via LDC.
frame  semantics  nlp  language 
august 2011 by rybesh
SemLink
SemLink is a project whose aim is to link together different lexical resources via a set of mappings. These mappings will make it possible to combine the different information provided by these different lexical resources for tasks such as inferencing. We also plan to use the mappings to aid in semi-automatic extension of each resources coverage, to increase the overall overlap in coverage. Currently, we are creating mappings between the following resources:

PropBank: A corpus of one million words of English text, annotated with argument role labels for verbs; and a lexicon defining those argument roles on a per-verb basis.
VerbNet: A lexicon that groups verbs based on their semantic/syntactic linking behavior.
FrameNet: A lexicon based on frame semantics.
WordNet: A lexicon that describes semantic relationships (such as synonymy and hyperonymy) between individual words.
frame  semantics  nlp  language 
august 2011 by rybesh
SEMAFOR: Semantic Analyzer of Frame Representations
SEMAFOR: Semantic Analysis of Frame Representations is a tool for automatic analysis of the frame-semantic structure of English text.
nlp  frames  semantic  parsing 
august 2011 by rybesh
ScalaNLP
ScalaNLP is a collection of libraries for Natural Language Processing, Machine Learning, and Statistics.
scala  nlp  linearalgebra  statistics 
august 2011 by rybesh
Corpus-Based Study of Scientific Methodology: Comparing the Historical and Experimental Sciences
This chapter studies the use of textual features based on systemic functional linguistics, for genre-based text categorization. We describe feature sets that represent different types of conjunctions and modal assessment, which together can partially indicate how different genres structure text and may prefer certain classes of attitudes towards propositions in the text. This enables analysis of large-scale rhetorical differences between genres by examining which features are important for classification. The specific domain we studied comprises scientific articles in historical and experimental sciences (paleontology and physical chemistry, respectively). We applied the SMO learning algorithm, which with our feature set achieved over 83% accuracy for classifying articles according to field, though no field-specific terms were used as features. The most highly-weighted features for each were consistent with hypothesized methodological differences between historical and experimental sciences, thus lending empirical evidence to the recent philosophical claim of multiple scientific methods.
nlp  rhetoric  science  history  language  genre  classification  linguistics 
july 2011 by rybesh
AKSW : Projects / FOX
FOX is a framework that integrates the Linked Data Cloud and makes uses of the diversity of NLP algorithms to extract RDF triples of high accuracy out of NL. In its current version, it integrates and merges the results of Named Entity Recognition, Keyword Extraction and Relation Extraction tools.
semweb  extraction  nlp  tools  ner 
july 2011 by rybesh
School of Informatics: Advanced Natural Language Processing
The course will synthesize recent research in linguistics, computer science, and natural language processing with the aim of introducing students to theoretical and computational models of language. The course will familiarize students with a wide range of linguistic phenomena with the aim of appreciating the complexity, but also the systematic behaviour of natural languages like English, the pervasiveness of ambiguity, and how this presents challenges in natural language processing. In addition, the course introduce the most important algorithms and data structures that are commonly used to solve many NLP problems.
nlp  syllabus  discourse 
june 2011 by rybesh
6.892: Computational Models of Discourse
This course is a graduate level introduction to automatic discourse processing. The emphasis will be on methods and models that have applicability to natural language and speech processing.

The class will cover the following topics: discourse structure, models of coherence and cohesion, plan recognition algorithms, and text segmentation. We will study symbolic as well as machine learning methods for discourse analysis. We will also discuss the use of these methods in a variety of applications ranging from dialogue systems to automatic essay writing.
discourse  modeling  nlp 
june 2011 by rybesh
ACL Anthology » LaTeCH 2011
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities.
nlp  culturalheritage  digitalhumanities 
june 2011 by rybesh
LaTeCH 2011: Language Technology for Cultural Heritage, Social Sciences, and Humanities
The LaTeCH workshop series aims to provide a forum for researchers who are working on developing novel information technology for improved information access to data from the Humanities, Social Sciences, and Cultural Heritage.

Recent developments in the Humanities, Social Sciences, and Cultural Heritage draw an increasing interest from researchers in NLP in developing methods for data cleaning, semantic annotation, intelligent querying, linking, discovery and visualisation of interesting trends. Language technology has an important role to play in these processes, even for collections which are primarily non-textual, since text is the pervasive medium used for metadata. These fairly novel domains of application entail new challenges to NLP research, such as noisy text (e.g., due to OCR problems), non-standard, or archaic language varieties (e.g., historic language, dialects, mixed use of languages, ellipsis, transcription errors), the necessity to link data of diverse formats (e.g., text, database, video, speech) and languages, and the lack of available resources, such as dictionaries. Furthermore, often neither annotated domain data is available, nor the required funds to manually create it, thus forcing researchers to investigate (semi-) automatic resource development and domain adaptation approaches involving the least possible manual effort.
nlp  culturalheritage  digitalhumanities 
june 2011 by rybesh
Template-Based Information Extraction without the Templates
Standard algorithms for template-based in- formation extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing tem- plate). This paper describes an approach to template-based IE that removes this require- ment and performs extraction without know- ing the template structure in advance. Our al- gorithm instead learns the template structure automatically from raw text, inducing tem- plate schemas as sets of linked events (e.g., bombings include detonate, set off, and de- stroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we in- duce template structure very similar to hand- created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
events  extraction  nlp 
june 2011 by rybesh
Event Extraction as Dependency Parsing
Nested event structures are a common occur- rence in both open domain and domain spe- cific extraction tasks, e.g., a “crime” event can cause a “investigation” event, which can lead to an “arrest” event. However, most cur- rent approaches address event extraction with highly local models that extract each event and argument independently. We propose a simple approach for the extraction of such structures by taking the tree of event-argument relations and using it directly as the representation in a reranking dependency parser. This provides a simple framework that captures global prop- erties of both nested and flat event structures. We explore a rich feature space that models both the events to be parsed and context from the original supporting text. Our approach ob- tains competitive results in the extraction of biomedical events from the BioNLP’09 shared task with a F1 score of 53.5% in development and 48.6% in testing.
events  extraction  nlp 
june 2011 by rybesh
shravanmn/Yahoo_LDA at master - GitHub
Yahoo!'s topic modelling framework using Latent Dirichlet Allocation.
hadoop  nlp  topicmodeling 
june 2011 by rybesh
Kea
KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
indexing  tools  nlp 
june 2011 by rybesh
maui-indexer - Maui - Multi-purpose automatic topic indexing - Google Project Hosting
Maui automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles.

Maui performs the following tasks:

term assignment with a controlled vocabulary (or thesaurus)
subject indexing
topic indexing with terms from Wikipedia
keyphrase extraction
terminology extraction
automatic tagging
It can also be used for terminology extraction and semi-automatic topic indexing.
indexing  vocabulary  tools  nlp  machinelearning  java 
june 2011 by rybesh
Wikipedia Miner - Home
Wikipedia Miner is a toolkit for navigating and making use of the structure and content of Wikipedia. It aims to make it easy for you to integrate Wikipedia's knowledge into your own applications, by:

providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
wikipedia  textmining  nlp  webservices  tools  datamining 
may 2011 by rybesh
Penn Treebank P.O.S. Tags
Alphabetical list of part-of-speech tags used in the Penn Treebank Project.
linguistics  nlp  reference 
april 2011 by rybesh
TiMBL: Tilburg Memory-Based Learner
TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases.

For the past decade, TiMBL has been mostly used in natural language processing as a machine learning classifier component, but its use extends to virtually any supervised machine learning domain. Due to its particular decision-tree-based implementation, TiMBL is in many cases far more efficient in classification than a standard k-nearest neighbor algorithm would be.
nlp  machinelearning  tools 
april 2011 by rybesh
Data Science Toolkit
A collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained VM or EC2 AMI that you can deploy yourself.
data  tools  nlp  ec2  webservices 
march 2011 by rybesh
Lippmannian Device
Lippmannian device is named after Lippmann, and provides a coarse means of showing actor partisanship.
research  tools  analysis  nlp  rhetoric 
march 2011 by rybesh
edu.stanford.nlp.ling (Stanford JavaNLP API)
This package contains the different data structures used by JavaNLP throughout the years for dealing with linguistic objects in general, of which words are the most generally used.
nlp  data  structures  models 
february 2011 by rybesh
Python Interface to Stanford Parser
A python interface to the Stanford Parser. It uses JPype to create a Java virtual machine, instantiate the parser, and call methods on it. Most of the code is focused on getting the Stanford Dependencies, but it's easy to add API to call any method on the parser.
java  python  nlp 
february 2011 by rybesh
ARCADE: Literature, the Humanities, and the World
...digital media and huge databases have enormous potential for supporting, preserving, and making available for study the kinds of underground knowledges and cultural productions outside the sphere of mainstream print that you're concerned about. This is the insurgent potential of the Internet and digital media--they can bypass established methods of fixation and legitimation of cultural products. But in academia these are subjects of interest to humanists--and sociologists and anthropologists. By contrast, when true disciplinary outsiders like Jean-Baptiste Michel and his team enter the arena of cultural history and cultural studies from the side of science and engineering, they must be looking to legitimate themselves by proving that their approach "works" for subjects that they imagine will be widely recognized as significant.
digitalhumanities  nlp  statistics  critique 
december 2010 by rybesh
edwired » Blog Archive » Visualizing Millions of Words
...the lesson that I would then focus on with my students is that what they are looking at in such a graph is nothing more or less than the frequency with which a word is used in book (and only books) published over the centuries. While such frequencies do reflect something, it is not clear from one graph just what that something is. So instead of an answer, a graph like this one is a doorway that leads to a room filled with questions, each of which must be answered by the historian before he or she knows something worth knowing.
digitalhumanities  nlp  statistics 
december 2010 by rybesh
Works Cited: Google Books Ngrams and the number of words for "snow"
There's a certain Words For Snowism in the online Google Books Ngrams tool, the suggestion that the more frequently a word is used, the more important it is in a collective unconscious of which the Google Books data set serves as a convenient index. This importance is not the same thing as significance, in the sense of significant digits or statistical significance; it's not the difference that makes a difference, but rather a psychologized importance--attachment, cathexis. Which is really kind of garbage.
nlp  digitalhumanities  statistics  critique 
december 2010 by rybesh
Stanford CoreNLP
Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.

Stanford CoreNLP integrates all our NLP tools for the English language, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, and the coreference resolution system. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible, i.e., with a single option you can change which tools should be enabled and which should be disabled.
nlp  research  tools  java  nlproc 
december 2010 by rybesh
tm - Text Mining Package
tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
R  textmining  datamining  nlp  tools  statistics 
october 2010 by rybesh
GEOLocate - Software for Georeferencing Natural History Data
The GEOLocate project is an effort to develop software and services for translating textual locality descriptions associated with biodiversity collections data into geographic coordinates.
locative  tools  georeferencing  nlp 
october 2010 by rybesh
MALLET homepage
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
datamining  java  machinelearning  nlp  tools 
october 2010 by rybesh
CRCnetBASE - Text Mining
Giving a broad perspective of the field from numerous vantage points, Text Mining: Classification, Clustering, and Applications focuses on statistical methods for text mining and analysis. It examines methods to automatically cluster and classify text documents and applies these methods in a variety of areas, including adaptive information filtering, information distillation, and text search.

The book begins with chapters on the classification of documents into predefined categories. It presents state-of-the-art algorithms and their use in practice. The next chapters describe novel methods for clustering documents into groups that are not predefined. These methods seek to automatically determine topical structures that may exist in a document corpus. The book concludes by discussing various text mining applications that have significant implications for future research and industrial use.
textmining  nlp 
september 2010 by rybesh
Text Processing APIs and Python NLTK Demos | Text Mining | Stemming | Tagging | Python NLTK Demo
The Text Processing API supports the following functionality:

Stemming & Lemmatization
Sentiment Analysis
Tagging and Chunk Extraction
nlp  api  webservices  python 
august 2010 by rybesh
LingPipe Book
We're writing a book about LingPipe. As it's written, we'll be putting up drafts here.
nlp  java 
august 2010 by rybesh
Extend Swift | SiLCC
SiLCC is a cloud based service for parsing text and extracting relevant keywords.
nlp  tools  tagging  metadata  api 
august 2010 by rybesh
Training Examples Q&A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
Where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!
ai  machinelearning  nlp  textanalysis  ir  datamining  search  statistics  infoviz  reference 
june 2010 by rybesh
IBM Emerging Technologies - BigSheets
BigSheets is an extension of the mashup paradigm that:
1. Integrates gigabytes, terabytes, or petabytes of unstructured data from web-based repositories
2. Collects a wide range of unstructured web data stemming from user-defined seed URLs
3. Extracts and Enriches that data using the unstructured information management architecture you choose (LanguageWare,OpenCalais, etc.)
4. Lets you Explore and Visualize this data in specific, user defined contexts. (such as ManyEyes)
data  analytics  hadoop  spreadsheet  archives  nlp  infoviz 
march 2010 by rybesh
Blegging for Help: Web Scraping for Content? « LingPipe Blog
In search of a good general-purpose method of pulling the content out of arbitrary web pages and leaving the boilerplate, advertising, navigation, etc. behind. See also http://bit.ly/4SFOIH
web  nlp  html  parsing  textanalysis 
january 2010 by rybesh
Maximum Entropy (GA) Model Optimization Package
Maximum entropy (aka logistic regression) models are very popular, especially in natural language processing. The software here is an implementation of maximum likelihood and maximum a posterior optimization of the parameters of these models. The algorithms used are much more efficient than the iterative scaling techniques used in almost every other maxent package out there.
research  tools  nlp  statistics  machinelearning  ocaml  logreg  maxent 
august 2009 by rybesh
Python Package Index : topia.termextract 1.1.0
This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.
python  nlp  extraction 
august 2009 by rybesh
LingPipe
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
java  nlp  datamining  tools  entitydetection 
may 2009 by rybesh
nltk.collocations
Tools to identify collocations --- words that often appear consecutively --- within corpora. They may also be used to find other associations between word occurrences.
python  tools  nlp  nltk 
february 2009 by rybesh
« earlier      

related tags

academia  advertising  ai  analysis  analytics  annotation  api  architecture  archives  art  authority  bayes  berkeley  bias  blog  c++  categorization  citations  classification  clojure  CMS  code  collaboration  commercial  community  conference  copyright  coreference  corpus  courses  crf  critique  culturalheritage  data  database  datamining  definition  design  detection  digitalhumanities  discourse  distributed  documents  ec2  editorsnotes  education  election  english  entitydetection  entityrecognition  event  events  examples  extraction  fall2004  forecasting  frame  framenet  frames  future  genre  georeferencing  grid  hadoop  health  history  howto  html  hypermedia  ideas  identity  image  indexing  information  infoviz  interface  international  ir  japan  java  journalism  knowledge  language  lda  linearalgebra  linguistics  literature  locative  logreg  machinelearning  management  maps  marketing  matlab  maxent  meaning  media  metadata  methods  modeling  models  music  narrative  NEE  ner  networking  networks  news  ngrams  nlp  nlproc  nltk  ocaml  ontology  opendata  opensource  organization  parsing  pdf  people  perl  personalization  perspective  plda  poland  prediction  publishing  python  quantitative  R  recognition  reference  regex  religion  research  rhetoric  scala  science  search  semantic  semantics  sematics  semweb  sfbayarea  social  socialscience  software  spreadsheet  spring2006  standards  stanford  statistics  structures  syllabus  tagging  temporal  temporality  textanalysis  textmining  time  tools  topicmodeling  topicmodels  trees  tutorial  unix  visualization  vocabulary  web  webservices  wikipedia  wordnet 

Copy this bookmark:



description:


tags: