Coursera - Stanford NLP class
yesterday by rybesh
Jurafsky and Manning's online NLP course.
nlp
education
yesterday by rybesh
gilesc/stanford-corenlp
10 days ago by rybesh
Clojure wrapper for Stanford CoreNLP tools.
nlp
clojure
10 days ago by rybesh
NEH Digital Humanities Startup Grants: Funding the Future « Early Modern Online Bibliography
15 days ago by rybesh
The video “How Natural Language Processing is Changing Research” provides a more extended look at WordSeer’s usefulness for analyzing slave narratives, but its purpose is also to underscore how such a tool can benefit humanities scholars. In this video the discussion veers toward presenting reading as a chore from which humanities scholars seek relief. On that note, a student in Dr. Michael Ullyot’s undergraduate ENG 203 course, “Hamlet in the Humanities Lab” at the University of Calgary offers some pertinent comments. In her penultimate blog post for the course, Stephanie Vandework devotes a section to “The Pros and Cons of Exploratory Analysis” and examines more closely the claims in the WordSeer Shakespeare demo, finding some to suffer from overgeneralization. (For a view of the course from the instructor’s perspective, see Dr. Ullyot’s presentation, Teaching Hamlet in the Humanities Lab, for the Renaissance Society of America conference this past March 2012.)
nlp
digitalhumanities
textanalysis
15 days ago by rybesh
NEH Digital Humanities Lightning Round 2011 Part 2 - YouTube
15 days ago by rybesh
NEH DH Lightning Round on Wordseer.
nlp
digitalhumanities
textanalysis
15 days ago by rybesh
Parsing Time: Learning to Interpret Time Expressions
16 days ago by rybesh
We present a probabilistic approach for learning to interpret temporal phrases given only a corpus of utterances and the times they reference. While most approaches to the task have used regular expressions and similar linear pattern interpretation rules, the possibility of phrasal embedding and modification in time expressions motivates our use of a compositional grammar of time expressions. This grammar is used to construct a latent parse which evaluates to the time the phrase would represent, as a logical parse might evaluate to a concrete entity. In this way, we can employ a loosely supervised EM-style bootstrapping approach to learn these latent parses while capturing both syntactic uncertainty and pragmatic ambiguity in a probabilistic framework. We achieve an accuracy of 72% on an adapted TempEval-2 task – comparable to state of the art systems.
time
temporal
parsing
nlp
16 days ago by rybesh
The Stanford NLP (Natural Language Processing) Group
4 weeks ago by rybesh
Natural Language Understanding requires a large amount of background "common sense" knowledge about the situation under discussion. In many respects, using this knowledge is at the core of reasoning and acting in traditional Artificial Intelligence. When reading an article about a criminal conviction, the writer assumes the reader knows about trials, juries, and criminal activity. The Narrative Chain project aims to learn this knowledge by processing large amounts of text and learning which events tend to occur together. We are studying not just what can be learned, but also the best representation for this knowledge (graph, linear chain, frame?).
This project also includes research into ordering events in time. For instance, did the conviction or the sentencing happen first? We use modern machine learning techniques to find linguistic features that indicate this semantic ordering relation.
An example of a learned narrative event chain, with arrows indicating temporal ordering, is shown on the right. The bold words are the events, and the subj/obj terms indicate how the common actor in this narrative is involved in the event (the subject or object of the verb).
nlp
events
frames
narrative
This project also includes research into ordering events in time. For instance, did the conviction or the sentencing happen first? We use modern machine learning techniques to find linguistic features that indicate this semantic ordering relation.
An example of a learned narrative event chain, with arrows indicating temporal ordering, is shown on the right. The bold words are the events, and the subj/obj terms indicate how the common actor in this narrative is involved in the event (the subject or object of the verb).
4 weeks ago by rybesh
Computational Linguistics for Literature
4 weeks ago by rybesh
The amount of literary material available on-line keeps growing rapidly. Not only are there machine-readable texts in libraries, collections and e-book stores, but there is also more and more “live” literature – e-zines, blogs, self-published e-books and so on. There is a need for tools to help users navigate, visualize and appreciate high volume of available literature.
Literary texts are quite different from technical and formal documents, which have been the focus of NLP research thus far. Most forms of statistical language processing rely on lexical information in one way or another. In literature, the primary mode is narrative rather than exposition. Stories may be cognitively easier to read than certain expository genres, such as scientific documents, but it is a challenging form of discourse for NLP tools and methods. For instance, literary prose lacks overt lexical clues and structural markers typically leveraged in the processing of more structured genres. Also, even conventional literary texts exhibit far less unity of time, space and topic than most formal discourse. Learning to handle these challenges in literary data may help move past heavy reliance on surface clues in general.
Literature also differs from other genres because of the needs of its typical audience. For instance, reading, searching or browsing literature online is a different task than searching for the latest news on a particular topic. Search criteria would be rather abstract: not a keyword, but a literary style, similarity to another work, point of view and so on. When looking for a summary or a digest, a reader may prefer to know or visualize a text's broad characteristics than facts which summarize the plot.
We invite papers that touch upon these areas, but also welcome other ideas which promote the processing of literary narrative or related forms of discourse.
literature
nlp
digitalhumanities
narrative
Literary texts are quite different from technical and formal documents, which have been the focus of NLP research thus far. Most forms of statistical language processing rely on lexical information in one way or another. In literature, the primary mode is narrative rather than exposition. Stories may be cognitively easier to read than certain expository genres, such as scientific documents, but it is a challenging form of discourse for NLP tools and methods. For instance, literary prose lacks overt lexical clues and structural markers typically leveraged in the processing of more structured genres. Also, even conventional literary texts exhibit far less unity of time, space and topic than most formal discourse. Learning to handle these challenges in literary data may help move past heavy reliance on surface clues in general.
Literature also differs from other genres because of the needs of its typical audience. For instance, reading, searching or browsing literature online is a different task than searching for the latest news on a particular topic. Search criteria would be rather abstract: not a keyword, but a literary style, similarity to another work, point of view and so on. When looking for a summary or a digest, a reader may prefer to know or visualize a text's broad characteristics than facts which summarize the plot.
We invite papers that touch upon these areas, but also welcome other ideas which promote the processing of literary narrative or related forms of discourse.
4 weeks ago by rybesh
discourse structure reading group
5 weeks ago by rybesh
Daniel Marcu's discourse structure reading group at ISI.
nlp
discourse
5 weeks ago by rybesh
digital digs: the role of summary in composition
8 weeks ago by rybesh
The obvious question is how one manages to distinguish among summary, analysis, argument, and interpretation. E.g.
With the aid of a rag tag crew of adventurers, a young man rescues a princess from an evil empire and discovers his destiny to become a member of a dying order of knights.
A young man helps a rebel leader escape from an imperial prison and participates in an pitched battle to save the rebels' military base.
I assume you recognize the story, and I think most people would say the first summary is more accurate. Why? The second one is certainly not inaccurate. It simply downplays the "hero's journey" aspect and portrays the film as depicting a political and collective activity.
narrative
language
events
perspective
frames
nlp
With the aid of a rag tag crew of adventurers, a young man rescues a princess from an evil empire and discovers his destiny to become a member of a dying order of knights.
A young man helps a rebel leader escape from an imperial prison and participates in an pitched battle to save the rebels' military base.
I assume you recognize the story, and I think most people would say the first summary is more accurate. Why? The second one is certainly not inaccurate. It simply downplays the "hero's journey" aspect and portrays the film as depicting a political and collective activity.
8 weeks ago by rybesh
Index of /WordNet-Pairs
8 weeks ago by rybesh
What are the N most similar words to X, according to WordNet?
This data seeks to answer that question, where similarity is based on
measures from WordNet::Similarity. http://wn-similarity.sourceforge.net
nlp
wordnet
opendata
This data seeks to answer that question, where similarity is based on
measures from WordNet::Similarity. http://wn-similarity.sourceforge.net
8 weeks ago by rybesh
Apache Stanbol - Welcome to Apache Stanbol (incubating)
8 weeks ago by rybesh
Apache Stanbol (currently in incubation) is an open source modular software stack and reusable set of components for semantic content management.
Apache Stanbol components are meant to be accessed over RESTful interfaces to provide semantic services for content management. Thus, one application is to extend traditional content management systems with (internal or external) semantic services.
nlp
semweb
CMS
tools
editorsnotes
Apache Stanbol components are meant to be accessed over RESTful interfaces to provide semantic services for content management. Thus, one application is to extend traditional content management systems with (internal or external) semantic services.
8 weeks ago by rybesh
Digital Humanities 2011 tutorial
11 weeks ago by rybesh
Chris Manning's tutorial at Digital Humanities 2011 at Stanford.
nlp
tutorial
digitalhumanities
11 weeks ago by rybesh
Johansson & Nugues - LTH: Semantic Structure Extraction using Nonprojective Dependency Trees
11 weeks ago by rybesh
We describe our contribution to the SemEval task on Frame-Semantic Structure Extraction. Unlike most previous systems described in literature, ours is based on dependency syntax. We also describe a fully automatic method to add words to the FrameNet lexical database, which gives an improvement in the recall of frame detection.
nlp
frames
framenet
parsing
11 weeks ago by rybesh
Cross Validation vs. Inter-Annotator Agreement « LingPipe Blog
11 weeks ago by rybesh
Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct.
nlp
annotation
11 weeks ago by rybesh
Stanford Topic Modeling Toolbox
12 weeks ago by rybesh
Includes an implementation of PLDA.
Partially Labeled Dirchlet Allocation (PLDA) [paper] is a topic model that extends and generalizes both LDA and Labeled LDA. The model is analogous to Labeled LDA except that it allows more than one latent topic per label and a set of background labels. Learning and inference in the model is much like the example above for Labeled LDA, but you must additionally specify the number of topics associated with each label.
lda
plda
metadata
topicmodels
nlp
socialscience
scala
Partially Labeled Dirchlet Allocation (PLDA) [paper] is a topic model that extends and generalizes both LDA and Labeled LDA. The model is analogous to Labeled LDA except that it allows more than one latent topic per label and a set of background labels. Learning and inference in the model is much like the example above for Labeled LDA, but you must additionally specify the number of topics associated with each label.
12 weeks ago by rybesh
Natural Language Software Registry
february 2012 by rybesh
The Natural Language Software Registry (NLSR) is a concise summary of the capabilities and sources of a large amount of natural language processing (NLP) software available to the NLP community. It comprises academic, commercial and proprietary software with specifications and terms on which it can be acquired clearly indicated.
nlp
linguistics
tools
february 2012 by rybesh
Robert Young - Text Understanding: A Survey
february 2012 by rybesh
The goal of the study is to examine work that has something to offer toward the construction of a computable model of text understanding. It focuses on those aspects of meaning that are conveyed only by groups of connected sentences—texts. Additionally, only work that attempts to deal with the semantics or understanding of texts, as opposed to statistical or syntactic analysis, is considered.
nlp
textanalysis
semantics
february 2012 by rybesh
NLP Ecosystem
february 2012 by rybesh
The iDASH NLP Ecosystem is a place to share and access tools, data, and educational resources for developing and applying NLP to clinical text. 2011-2012 talks on temporal reasoning.
health
nlp
temporality
time
february 2012 by rybesh
N-grams: corpus based (COCA, COHA, Spanish, Portuguese)
february 2012 by rybesh
These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 425 million word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.
english
corpus
linguistics
nlp
ngrams
february 2012 by rybesh
Conditional Random Fields
february 2012 by rybesh
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences. The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world tasks in many fields, including bioinformatics, computational linguistics and speech recognition.
machinelearning
nlp
crf
textmining
metadata
february 2012 by rybesh
splitta - statistical sentence boundary detection
january 2012 by rybesh
Sentence tokenizer written in python. Includes proper tokenization and models for very high accuracy sentence boundary detection (English only for now). The models are trained from Wall Street Journal news combined with the Brown Corpus which is intended to be widely representative of written English. Error rates on test news data are near 0.25%.
nlp
python
january 2012 by rybesh
DDupe
january 2012 by rybesh
Visualizing and analyzing social networks is a challenging problem that has been receiving growing attention. An important first step, before analysis can begin, is ensuring that the data is accurate. A common data quality problem is that the data may inadvertently contain several distinct references to the same underlying entity; the process of reconciling these references is called entity resolution. D-Dupe is an interactive tool that combines data mining algorithms for entity resolution with a task-specific network visualization. Users cope with complexity of cleaning large networks by focusing on a small subnetwork containing a potential duplicate pair. The subnetwork highlights relationships in the social network, making the common relationships easy to visually identify. D-Dupe users resolve ambiguities either by merging nodes or by marking them distinct. The entity resolution process is iterative: as pairs of nodes are resolved, additional duplicates may be revealed; therefore, resolution decisions are often chained together. We give examples of how users can flexibly apply sequences of actions to produce a high quality entity resolution result.
datamining
nlp
networks
visualization
january 2012 by rybesh
“Beautiful” in Shakespeare « Text Mining and the Digital Humanities
december 2011 by rybesh
Great, clear example of text mining using Wordseer.
digitalhumanities
textmining
textanalysis
nlp
infoviz
examples
december 2011 by rybesh
A Unified Event Coreference Resolution by Integrating Multiple Resolvers
december 2011 by rybesh
Event coreference is an important and complicated task in cascaded event template extraction and other natural language processing tasks. Despite its importance, it was merely discussed in previous studies. In this paper, we present a globally optimized coreference resolution system dedicated to various sophisticated event coreference phenomena. Seven resolvers for both event and object coreference cases are utilized, which include three new resolvers for event coreference resolution. Three enhancements are further proposed at both mention pair detection and chain formation levels. First, the object coreference resolvers are used to effectively reduce the false positive cases for event coreference. Second, A revised instance selection scheme is proposed to improve link level mention-pair model performances. Last but not least, an efficient and globally optimized graph partitioning model is employed for coreference chain formation using spectral partitioning which allows the incorporation of pronoun coreference information. The three techniques contribute to a significant improvement of 8.54% in B 3 F-score for event coreference resolution on OntoNotes 2.0 corpus.
events
nlp
coreference
december 2011 by rybesh
Aravind K. Joshi - Towards Discourse Meaning
november 2011 by rybesh
The overall goal is to discuss some issues concerning the dependencies at the discourse level and at the sentence level. However, first I will briefly describe the Penn Discourse Treebank (PDTB)*, a corpus in which we annotate the discourse connectives (explicit and implicit) and their arguments together with "attributions" of the arguments and the relations denoted by the connectives, and also the senses of the connectives. I will then focus on the complexity of dependencies in terms of (a) the elements that bear the dependency relations, (b) graph theoretic properties of these dependencies such as nested and crossed dependencies, dependencies with shared arguments, and (c) attributions and their relationship to the dependencies, among others. I will compare these dependencies with those at the sentence level and discuss some issues that relate to the transition from the sentence level to the level of "immediate discourse" and propose some conjectures.
discourse
meaning
linguistics
nlp
november 2011 by rybesh
chromium-compact-language-detector - C++ library and Python bindings for detecting language from UTF8 text, extracted from the Chromium browser - Google Project Hosting
november 2011 by rybesh
This is a straight port from the CLD (Compact Language Detector) library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings.
language
detection
nlp
python
november 2011 by rybesh
Dan Jurafsky, Syntactic Variations Versus Semantic Roles
october 2011 by rybesh
Dan Jurafsky, Syntactic Variations Versus Semantic Roles, Some Typical Semantic Roles, Two Solutions To The Difficulty Of Defining Semantic Roles, PropBank, FrameNet, Information Extraction Versus Semantic Role Labeling, Evaluation Measures, Parsing Algorithm, Combining Identification And Classification Models, Summary
nlp
frame
semantics
october 2011 by rybesh
Inter-Event Dependencies support Event Extraction from Biomedical Literature
october 2011 by rybesh
The description of events in biomedical literature often follows discourse patterns.
For example, authors may firstly mention the transcription of a
gene, and then go on to describe how this transcription is regulated by another
gene. Capturing such patterns can be beneficial when we want to extract event
mentions from literature. For instance, detecting the mention of a transcription of
gene A gives us a hint to actively look for mentions of regulations involving A.
With this hint we could find such mentions even if they follow unseen lexical or
syntactic patterns. To exploit such hints we need to perform event extraction in a
cross sentence manner.
It is shown that imperatively defined factor graphs (IDF) are an intuitive way to
build Markov Networks that model inter-dependencies between mentions of events
within sentences, and across sentence-boundaries. Small pieces of procedural code
define the graph structure, feature functions and hooks for efficient inference.
Empirically, this leads to an efficient cross-sentence event extractor with very
competitive results on the BioNLP shared task. One of our inter-event features
shows an impact of 1:94 points in F1 for the class of regulation events.
nlp
event
extraction
For example, authors may firstly mention the transcription of a
gene, and then go on to describe how this transcription is regulated by another
gene. Capturing such patterns can be beneficial when we want to extract event
mentions from literature. For instance, detecting the mention of a transcription of
gene A gives us a hint to actively look for mentions of regulations involving A.
With this hint we could find such mentions even if they follow unseen lexical or
syntactic patterns. To exploit such hints we need to perform event extraction in a
cross sentence manner.
It is shown that imperatively defined factor graphs (IDF) are an intuitive way to
build Markov Networks that model inter-dependencies between mentions of events
within sentences, and across sentence-boundaries. Small pieces of procedural code
define the graph structure, feature functions and hooks for efficient inference.
Empirically, this leads to an efficient cross-sentence event extractor with very
competitive results on the BioNLP shared task. One of our inter-event features
shows an impact of 1:94 points in F1 for the class of regulation events.
october 2011 by rybesh
LingPipe: Competition
september 2011 by rybesh
On this page, we break our competition down into academic toolkits and industrial toolkits. We only consider software that is available for linguistic processing, not companies that rely on linguistic processing in an application but do not sell that technology.
nlp
software
september 2011 by rybesh
CCG: Software - Illinois Semantic Role Labeler (SRL)
september 2011 by rybesh
Semantic Role Labeler is a machine-learning based tool that analyzes for a shallow semantic information of a given sentence. The tool is capable of outputing verb-argument structure following the notation defined by the Propbank project.
frame
semantics
nlp
september 2011 by rybesh
mate-tools - Tools for Natural Language Analysis, Generation and Machine Learning - Google Project Hosting
september 2011 by rybesh
The tools provide a pipeline of modules that carry out lemmatization, part-of-speech tagging, dependency parsing, and semantic role labeling of a sentence. The system’s two main components draw on improved versions of a state-of-the-art dependency parser (Bohnet, 2010) and semantic role labeler (Björkelund et al.,2009) developed independently by the authors. The tools are language independent, provide a very high accuracy and are fast. The dependency parser had the top score for German and English dependency parsing in the CoNLL shared task 2009.
nlp
frame
sematics
september 2011 by rybesh
Detection, Representation, and Exploitation of Events in the Semantic Web Workshop in conjunction with the 10th International Semantic Web Conference 2011 23 October Registration now open at: http://iswc2011.semanticweb.org/attending/registration
september 2011 by rybesh
In recent years, researchers in several communities involved in aspects of the web have begun to realise the potential benefits of assigning an important role to events in the representation and organisation of knowledge and media. While a good deal of relevant research has been done in the semantic web community (for example on the modeling of events), a lot of complementary research has been done in other communities, such as multimedia processing and information retrieval. The goal of this workshop is to advance research on this general topic within the semantic web community, by both building on existing semantic web work and integrating results and methods from other areas, with a particular focus on issues that are central to the semantic web.
events
modeling
semweb
nlp
september 2011 by rybesh
Using predicate-argument structures for information extraction
august 2011 by rybesh
In this paper we present a novel, customizable IE paradigm that takes advantage of predicate-argument structures. We also introduce a new way of automatically identifying predicate argument structures, which is central to our IE paradigm. It is based on: (1) an extended set of features; and (2) inductive decision tree learning. The experimental results prove our claim that accurate predicate-argument structures enable high quality IE results.
frame
semantics
nlp
information
extraction
august 2011 by rybesh
Semantic Role Labeler
august 2011 by rybesh
Demo of mate-tools semantic role labeler.
frame
semantics
nlp
august 2011 by rybesh
Martha Palmer | Projects | Verb Net
august 2011 by rybesh
VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), Xtag (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998). VerbNet is organized into verb classes extending Levin (1993) classes through refinement and addition of subclasses to achieve syntactic and semantic coherence among members of a class. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function, in a manner similar to the event decomposition of Moens and Steedman (1988).
corpus
linguistics
nlp
language
data
frame
semantics
august 2011 by rybesh
LDC Catalog
august 2011 by rybesh
Proposition Bank I was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T14 and ISBN 1-58563-304-6.
This is a semantic annotation of the Wall Street Journal section of Treebank-2. More specifically, each verb occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs have also been tagged with coarse grained senses and with inflectional information. This work was done in the Computer and Information Sciences Department at the University of Pennsylvania.
frame
semantics
nlp
language
data
This is a semantic annotation of the Wall Street Journal section of Treebank-2. More specifically, each verb occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs have also been tagged with coarse grained senses and with inflectional information. This work was done in the Computer and Information Sciences Department at the University of Pennsylvania.
august 2011 by rybesh
Martha Palmer | Projects | ACE
august 2011 by rybesh
The original PropBank project, funded by ACE, created a corpus of text annotated with information about basic semantic propositions. Predicate-argument relations were added to the syntactic trees of the Penn Treebank. This resource is now available via LDC.
frame
semantics
nlp
language
august 2011 by rybesh
SemLink
august 2011 by rybesh
SemLink is a project whose aim is to link together different lexical resources via a set of mappings. These mappings will make it possible to combine the different information provided by these different lexical resources for tasks such as inferencing. We also plan to use the mappings to aid in semi-automatic extension of each resources coverage, to increase the overall overlap in coverage. Currently, we are creating mappings between the following resources:
PropBank: A corpus of one million words of English text, annotated with argument role labels for verbs; and a lexicon defining those argument roles on a per-verb basis.
VerbNet: A lexicon that groups verbs based on their semantic/syntactic linking behavior.
FrameNet: A lexicon based on frame semantics.
WordNet: A lexicon that describes semantic relationships (such as synonymy and hyperonymy) between individual words.
frame
semantics
nlp
language
PropBank: A corpus of one million words of English text, annotated with argument role labels for verbs; and a lexicon defining those argument roles on a per-verb basis.
VerbNet: A lexicon that groups verbs based on their semantic/syntactic linking behavior.
FrameNet: A lexicon based on frame semantics.
WordNet: A lexicon that describes semantic relationships (such as synonymy and hyperonymy) between individual words.
august 2011 by rybesh
SEMAFOR: Semantic Analyzer of Frame Representations
august 2011 by rybesh
SEMAFOR: Semantic Analysis of Frame Representations is a tool for automatic analysis of the frame-semantic structure of English text.
nlp
frames
semantic
parsing
august 2011 by rybesh
ScalaNLP
august 2011 by rybesh
ScalaNLP is a collection of libraries for Natural Language Processing, Machine Learning, and Statistics.
scala
nlp
linearalgebra
statistics
august 2011 by rybesh
Corpus-Based Study of Scientific Methodology: Comparing the Historical and Experimental Sciences
july 2011 by rybesh
This chapter studies the use of textual features based on systemic functional linguistics, for genre-based text categorization. We describe feature sets that represent different types of conjunctions and modal assessment, which together can partially indicate how different genres structure text and may prefer certain classes of attitudes towards propositions in the text. This enables analysis of large-scale rhetorical differences between genres by examining which features are important for classification. The specific domain we studied comprises scientific articles in historical and experimental sciences (paleontology and physical chemistry, respectively). We applied the SMO learning algorithm, which with our feature set achieved over 83% accuracy for classifying articles according to field, though no field-specific terms were used as features. The most highly-weighted features for each were consistent with hypothesized methodological differences between historical and experimental sciences, thus lending empirical evidence to the recent philosophical claim of multiple scientific methods.
nlp
rhetoric
science
history
language
genre
classification
linguistics
july 2011 by rybesh
AKSW : Projects / FOX
july 2011 by rybesh
FOX is a framework that integrates the Linked Data Cloud and makes uses of the diversity of NLP algorithms to extract RDF triples of high accuracy out of NL. In its current version, it integrates and merges the results of Named Entity Recognition, Keyword Extraction and Relation Extraction tools.
semweb
extraction
nlp
tools
ner
july 2011 by rybesh
School of Informatics: Advanced Natural Language Processing
june 2011 by rybesh
The course will synthesize recent research in linguistics, computer science, and natural language processing with the aim of introducing students to theoretical and computational models of language. The course will familiarize students with a wide range of linguistic phenomena with the aim of appreciating the complexity, but also the systematic behaviour of natural languages like English, the pervasiveness of ambiguity, and how this presents challenges in natural language processing. In addition, the course introduce the most important algorithms and data structures that are commonly used to solve many NLP problems.
nlp
syllabus
discourse
june 2011 by rybesh
6.892: Computational Models of Discourse
june 2011 by rybesh
This course is a graduate level introduction to automatic discourse processing. The emphasis will be on methods and models that have applicability to natural language and speech processing.
The class will cover the following topics: discourse structure, models of coherence and cohesion, plan recognition algorithms, and text segmentation. We will study symbolic as well as machine learning methods for discourse analysis. We will also discuss the use of these methods in a variety of applications ranging from dialogue systems to automatic essay writing.
discourse
modeling
nlp
The class will cover the following topics: discourse structure, models of coherence and cohesion, plan recognition algorithms, and text segmentation. We will study symbolic as well as machine learning methods for discourse analysis. We will also discuss the use of these methods in a variety of applications ranging from dialogue systems to automatic essay writing.
june 2011 by rybesh
ACL Anthology » LaTeCH 2011
june 2011 by rybesh
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities.
nlp
culturalheritage
digitalhumanities
june 2011 by rybesh
LaTeCH 2011: Language Technology for Cultural Heritage, Social Sciences, and Humanities
june 2011 by rybesh
The LaTeCH workshop series aims to provide a forum for researchers who are working on developing novel information technology for improved information access to data from the Humanities, Social Sciences, and Cultural Heritage.
Recent developments in the Humanities, Social Sciences, and Cultural Heritage draw an increasing interest from researchers in NLP in developing methods for data cleaning, semantic annotation, intelligent querying, linking, discovery and visualisation of interesting trends. Language technology has an important role to play in these processes, even for collections which are primarily non-textual, since text is the pervasive medium used for metadata. These fairly novel domains of application entail new challenges to NLP research, such as noisy text (e.g., due to OCR problems), non-standard, or archaic language varieties (e.g., historic language, dialects, mixed use of languages, ellipsis, transcription errors), the necessity to link data of diverse formats (e.g., text, database, video, speech) and languages, and the lack of available resources, such as dictionaries. Furthermore, often neither annotated domain data is available, nor the required funds to manually create it, thus forcing researchers to investigate (semi-) automatic resource development and domain adaptation approaches involving the least possible manual effort.
nlp
culturalheritage
digitalhumanities
Recent developments in the Humanities, Social Sciences, and Cultural Heritage draw an increasing interest from researchers in NLP in developing methods for data cleaning, semantic annotation, intelligent querying, linking, discovery and visualisation of interesting trends. Language technology has an important role to play in these processes, even for collections which are primarily non-textual, since text is the pervasive medium used for metadata. These fairly novel domains of application entail new challenges to NLP research, such as noisy text (e.g., due to OCR problems), non-standard, or archaic language varieties (e.g., historic language, dialects, mixed use of languages, ellipsis, transcription errors), the necessity to link data of diverse formats (e.g., text, database, video, speech) and languages, and the lack of available resources, such as dictionaries. Furthermore, often neither annotated domain data is available, nor the required funds to manually create it, thus forcing researchers to investigate (semi-) automatic resource development and domain adaptation approaches involving the least possible manual effort.
june 2011 by rybesh
Template-Based Information Extraction without the Templates
june 2011 by rybesh
Standard algorithms for template-based in- formation extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing tem- plate). This paper describes an approach to template-based IE that removes this require- ment and performs extraction without know- ing the template structure in advance. Our al- gorithm instead learns the template structure automatically from raw text, inducing tem- plate schemas as sets of linked events (e.g., bombings include detonate, set off, and de- stroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we in- duce template structure very similar to hand- created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
events
extraction
nlp
june 2011 by rybesh
Event Extraction as Dependency Parsing
june 2011 by rybesh
Nested event structures are a common occur- rence in both open domain and domain spe- cific extraction tasks, e.g., a “crime” event can cause a “investigation” event, which can lead to an “arrest” event. However, most cur- rent approaches address event extraction with highly local models that extract each event and argument independently. We propose a simple approach for the extraction of such structures by taking the tree of event-argument relations and using it directly as the representation in a reranking dependency parser. This provides a simple framework that captures global prop- erties of both nested and flat event structures. We explore a rich feature space that models both the events to be parsed and context from the original supporting text. Our approach ob- tains competitive results in the extraction of biomedical events from the BioNLP’09 shared task with a F1 score of 53.5% in development and 48.6% in testing.
events
extraction
nlp
june 2011 by rybesh
shravanmn/Yahoo_LDA at master - GitHub
june 2011 by rybesh
Yahoo!'s topic modelling framework using Latent Dirichlet Allocation.
hadoop
nlp
topicmodeling
june 2011 by rybesh
maui-indexer - Maui - Multi-purpose automatic topic indexing - Google Project Hosting
june 2011 by rybesh
Maui automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles.
Maui performs the following tasks:
term assignment with a controlled vocabulary (or thesaurus)
subject indexing
topic indexing with terms from Wikipedia
keyphrase extraction
terminology extraction
automatic tagging
It can also be used for terminology extraction and semi-automatic topic indexing.
indexing
vocabulary
tools
nlp
machinelearning
java
Maui performs the following tasks:
term assignment with a controlled vocabulary (or thesaurus)
subject indexing
topic indexing with terms from Wikipedia
keyphrase extraction
terminology extraction
automatic tagging
It can also be used for terminology extraction and semi-automatic topic indexing.
june 2011 by rybesh
Wikipedia Miner - Home
may 2011 by rybesh
Wikipedia Miner is a toolkit for navigating and making use of the structure and content of Wikipedia. It aims to make it easy for you to integrate Wikipedia's knowledge into your own applications, by:
providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
wikipedia
textmining
nlp
webservices
tools
datamining
providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
may 2011 by rybesh
Penn Treebank P.O.S. Tags
april 2011 by rybesh
Alphabetical list of part-of-speech tags used in the Penn Treebank Project.
linguistics
nlp
reference
april 2011 by rybesh
lisp - How do I manipulate parse trees? - Stack Overflow
april 2011 by rybesh
Example of using Tregex and Tsurgeon.
nlp
trees
java
regex
april 2011 by rybesh
TiMBL: Tilburg Memory-Based Learner
april 2011 by rybesh
TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases.
For the past decade, TiMBL has been mostly used in natural language processing as a machine learning classifier component, but its use extends to virtually any supervised machine learning domain. Due to its particular decision-tree-based implementation, TiMBL is in many cases far more efficient in classification than a standard k-nearest neighbor algorithm would be.
nlp
machinelearning
tools
For the past decade, TiMBL has been mostly used in natural language processing as a machine learning classifier component, but its use extends to virtually any supervised machine learning domain. Due to its particular decision-tree-based implementation, TiMBL is in many cases far more efficient in classification than a standard k-nearest neighbor algorithm would be.
april 2011 by rybesh
Data Science Toolkit
march 2011 by rybesh
A collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained VM or EC2 AMI that you can deploy yourself.
data
tools
nlp
ec2
webservices
march 2011 by rybesh
Lippmannian Device
march 2011 by rybesh
Lippmannian device is named after Lippmann, and provides a coarse means of showing actor partisanship.
research
tools
analysis
nlp
rhetoric
march 2011 by rybesh
edu.stanford.nlp.ling (Stanford JavaNLP API)
february 2011 by rybesh
This package contains the different data structures used by JavaNLP throughout the years for dealing with linguistic objects in general, of which words are the most generally used.
nlp
data
structures
models
february 2011 by rybesh
Python Interface to Stanford Parser
february 2011 by rybesh
A python interface to the Stanford Parser. It uses JPype to create a Java virtual machine, instantiate the parser, and call methods on it. Most of the code is focused on getting the Stanford Dependencies, but it's easy to add API to call any method on the parser.
java
python
nlp
february 2011 by rybesh
ARCADE: Literature, the Humanities, and the World
december 2010 by rybesh
...digital media and huge databases have enormous potential for supporting, preserving, and making available for study the kinds of underground knowledges and cultural productions outside the sphere of mainstream print that you're concerned about. This is the insurgent potential of the Internet and digital media--they can bypass established methods of fixation and legitimation of cultural products. But in academia these are subjects of interest to humanists--and sociologists and anthropologists. By contrast, when true disciplinary outsiders like Jean-Baptiste Michel and his team enter the arena of cultural history and cultural studies from the side of science and engineering, they must be looking to legitimate themselves by proving that their approach "works" for subjects that they imagine will be widely recognized as significant.
digitalhumanities
nlp
statistics
critique
december 2010 by rybesh
edwired » Blog Archive » Visualizing Millions of Words
december 2010 by rybesh
...the lesson that I would then focus on with my students is that what they are looking at in such a graph is nothing more or less than the frequency with which a word is used in book (and only books) published over the centuries. While such frequencies do reflect something, it is not clear from one graph just what that something is. So instead of an answer, a graph like this one is a doorway that leads to a room filled with questions, each of which must be answered by the historian before he or she knows something worth knowing.
digitalhumanities
nlp
statistics
december 2010 by rybesh
Works Cited: Google Books Ngrams and the number of words for "snow"
december 2010 by rybesh
There's a certain Words For Snowism in the online Google Books Ngrams tool, the suggestion that the more frequently a word is used, the more important it is in a collective unconscious of which the Google Books data set serves as a convenient index. This importance is not the same thing as significance, in the sense of significant digits or statistical significance; it's not the difference that makes a difference, but rather a psychologized importance--attachment, cathexis. Which is really kind of garbage.
nlp
digitalhumanities
statistics
critique
december 2010 by rybesh
Stanford CoreNLP
december 2010 by rybesh
Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.
Stanford CoreNLP integrates all our NLP tools for the English language, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, and the coreference resolution system. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible, i.e., with a single option you can change which tools should be enabled and which should be disabled.
nlp
research
tools
java
nlproc
Stanford CoreNLP integrates all our NLP tools for the English language, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, and the coreference resolution system. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible, i.e., with a single option you can change which tools should be enabled and which should be disabled.
december 2010 by rybesh
tm - Text Mining Package
october 2010 by rybesh
tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
R
textmining
datamining
nlp
tools
statistics
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
october 2010 by rybesh
GEOLocate - Software for Georeferencing Natural History Data
october 2010 by rybesh
The GEOLocate project is an effort to develop software and services for translating textual locality descriptions associated with biodiversity collections data into geographic coordinates.
locative
tools
georeferencing
nlp
october 2010 by rybesh
MALLET homepage
october 2010 by rybesh
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
datamining
java
machinelearning
nlp
tools
october 2010 by rybesh
CRCnetBASE - Text Mining
september 2010 by rybesh
Giving a broad perspective of the field from numerous vantage points, Text Mining: Classification, Clustering, and Applications focuses on statistical methods for text mining and analysis. It examines methods to automatically cluster and classify text documents and applies these methods in a variety of areas, including adaptive information filtering, information distillation, and text search.
The book begins with chapters on the classification of documents into predefined categories. It presents state-of-the-art algorithms and their use in practice. The next chapters describe novel methods for clustering documents into groups that are not predefined. These methods seek to automatically determine topical structures that may exist in a document corpus. The book concludes by discussing various text mining applications that have significant implications for future research and industrial use.
textmining
nlp
The book begins with chapters on the classification of documents into predefined categories. It presents state-of-the-art algorithms and their use in practice. The next chapters describe novel methods for clustering documents into groups that are not predefined. These methods seek to automatically determine topical structures that may exist in a document corpus. The book concludes by discussing various text mining applications that have significant implications for future research and industrial use.
september 2010 by rybesh
Text Processing APIs and Python NLTK Demos | Text Mining | Stemming | Tagging | Python NLTK Demo
august 2010 by rybesh
The Text Processing API supports the following functionality:
Stemming & Lemmatization
Sentiment Analysis
Tagging and Chunk Extraction
nlp
api
webservices
python
Stemming & Lemmatization
Sentiment Analysis
Tagging and Chunk Extraction
august 2010 by rybesh
LingPipe Book
august 2010 by rybesh
We're writing a book about LingPipe. As it's written, we'll be putting up drafts here.
nlp
java
august 2010 by rybesh
Extend Swift | SiLCC
august 2010 by rybesh
SiLCC is a cloud based service for parsing text and extracting relevant keywords.
nlp
tools
tagging
metadata
api
august 2010 by rybesh
Training Examples Q&A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
june 2010 by rybesh
Where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!
ai
machinelearning
nlp
textanalysis
ir
datamining
search
statistics
infoviz
reference
june 2010 by rybesh
IBM Emerging Technologies - BigSheets
march 2010 by rybesh
BigSheets is an extension of the mashup paradigm that:
1. Integrates gigabytes, terabytes, or petabytes of unstructured data from web-based repositories
2. Collects a wide range of unstructured web data stemming from user-defined seed URLs
3. Extracts and Enriches that data using the unstructured information management architecture you choose (LanguageWare,OpenCalais, etc.)
4. Lets you Explore and Visualize this data in specific, user defined contexts. (such as ManyEyes)
data
analytics
hadoop
spreadsheet
archives
nlp
infoviz
1. Integrates gigabytes, terabytes, or petabytes of unstructured data from web-based repositories
2. Collects a wide range of unstructured web data stemming from user-defined seed URLs
3. Extracts and Enriches that data using the unstructured information management architecture you choose (LanguageWare,OpenCalais, etc.)
4. Lets you Explore and Visualize this data in specific, user defined contexts. (such as ManyEyes)
march 2010 by rybesh
Blegging for Help: Web Scraping for Content? « LingPipe Blog
january 2010 by rybesh
In search of a good general-purpose method of pulling the content out of arbitrary web pages and leaving the boilerplate, advertising, navigation, etc. behind. See also http://bit.ly/4SFOIH
web
nlp
html
parsing
textanalysis
january 2010 by rybesh
Maximum Entropy (GA) Model Optimization Package
august 2009 by rybesh
Maximum entropy (aka logistic regression) models are very popular, especially in natural language processing. The software here is an implementation of maximum likelihood and maximum a posterior optimization of the parameters of these models. The algorithms used are much more efficient than the iterative scaling techniques used in almost every other maxent package out there.
research
tools
nlp
statistics
machinelearning
ocaml
logreg
maxent
august 2009 by rybesh
Python Package Index : topia.termextract 1.1.0
august 2009 by rybesh
This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.
python
nlp
extraction
august 2009 by rybesh
LingPipe
may 2009 by rybesh
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
java
nlp
datamining
tools
entitydetection
may 2009 by rybesh
nltk.chunk.named_entity
february 2009 by rybesh
Named entity chunker for NLTK.
python
tools
entitydetection
nlp
nltk
february 2009 by rybesh
nltk.collocations
february 2009 by rybesh
Tools to identify collocations --- words that often appear consecutively --- within corpora. They may also be used to find other associations between word occurrences.
python
tools
nlp
nltk
february 2009 by rybesh
related tags
academia ⊕ advertising ⊕ ai ⊕ analysis ⊕ analytics ⊕ annotation ⊕ api ⊕ architecture ⊕ archives ⊕ art ⊕ authority ⊕ bayes ⊕ berkeley ⊕ bias ⊕ blog ⊕ c++ ⊕ categorization ⊕ citations ⊕ classification ⊕ clojure ⊕ CMS ⊕ code ⊕ collaboration ⊕ commercial ⊕ community ⊕ conference ⊕ copyright ⊕ coreference ⊕ corpus ⊕ courses ⊕ crf ⊕ critique ⊕ culturalheritage ⊕ data ⊕ database ⊕ datamining ⊕ definition ⊕ design ⊕ detection ⊕ digitalhumanities ⊕ discourse ⊕ distributed ⊕ documents ⊕ ec2 ⊕ editorsnotes ⊕ education ⊕ election ⊕ english ⊕ entitydetection ⊕ entityrecognition ⊕ event ⊕ events ⊕ examples ⊕ extraction ⊕ fall2004 ⊕ forecasting ⊕ frame ⊕ framenet ⊕ frames ⊕ future ⊕ genre ⊕ georeferencing ⊕ grid ⊕ hadoop ⊕ health ⊕ history ⊕ howto ⊕ html ⊕ hypermedia ⊕ ideas ⊕ identity ⊕ image ⊕ indexing ⊕ information ⊕ infoviz ⊕ interface ⊕ international ⊕ ir ⊕ japan ⊕ java ⊕ journalism ⊕ knowledge ⊕ language ⊕ lda ⊕ linearalgebra ⊕ linguistics ⊕ literature ⊕ locative ⊕ logreg ⊕ machinelearning ⊕ management ⊕ maps ⊕ marketing ⊕ matlab ⊕ maxent ⊕ meaning ⊕ media ⊕ metadata ⊕ methods ⊕ modeling ⊕ models ⊕ music ⊕ narrative ⊕ NEE ⊕ ner ⊕ networking ⊕ networks ⊕ news ⊕ ngrams ⊕ nlp ⊖ nlproc ⊕ nltk ⊕ ocaml ⊕ ontology ⊕ opendata ⊕ opensource ⊕ organization ⊕ parsing ⊕ pdf ⊕ people ⊕ perl ⊕ personalization ⊕ perspective ⊕ plda ⊕ poland ⊕ prediction ⊕ publishing ⊕ python ⊕ quantitative ⊕ R ⊕ recognition ⊕ reference ⊕ regex ⊕ religion ⊕ research ⊕ rhetoric ⊕ scala ⊕ science ⊕ search ⊕ semantic ⊕ semantics ⊕ sematics ⊕ semweb ⊕ sfbayarea ⊕ social ⊕ socialscience ⊕ software ⊕ spreadsheet ⊕ spring2006 ⊕ standards ⊕ stanford ⊕ statistics ⊕ structures ⊕ syllabus ⊕ tagging ⊕ temporal ⊕ temporality ⊕ textanalysis ⊕ textmining ⊕ time ⊕ tools ⊕ topicmodeling ⊕ topicmodels ⊕ trees ⊕ tutorial ⊕ unix ⊕ visualization ⊕ vocabulary ⊕ web ⊕ webservices ⊕ wikipedia ⊕ wordnet ⊕Copy this bookmark: