rybesh + textanalysis   31

About Campaign 2012 in the Media | Project for Excellence in Journalism (PEJ)
To arrive at the results regarding the tone of coverage, PEJ employed computer coding software developed by Crimson Hexagon along with PEJ's traditional media research methods.

The technology for Crimson Hexagon is rooted in an algorithm created by Gary King, a professor at Harvard University's Institute for Quantitative Social Science. (Click here to view the study explaining the algorithm.)

According to Crimson Hexagon, the purpose of computer coding is to "take as data a potentially large set of text documents, of which a small subset is hand coded into an investigator-chosen set of mutually exclusive and exhaustive categories. As output, the methods give approximately unbiased and statistically consistent estimates of the proportion of all documents in each category."
news  textanalysis  sentiment  machinelearning  classification 
17 hours ago by rybesh
NEH Digital Humanities Startup Grants: Funding the Future « Early Modern Online Bibliography
The video “How Natural Language Processing is Changing Research” provides a more extended look at WordSeer’s usefulness for analyzing slave narratives, but its purpose is also to underscore how such a tool can benefit humanities scholars. In this video the discussion veers toward presenting reading as a chore from which humanities scholars seek relief. On that note, a student in Dr. Michael Ullyot’s undergraduate ENG 203 course, “Hamlet in the Humanities Lab” at the University of Calgary offers some pertinent comments. In her penultimate blog post for the course, Stephanie Vandework devotes a section to “The Pros and Cons of Exploratory Analysis” and examines more closely the claims in the WordSeer Shakespeare demo, finding some to suffer from overgeneralization. (For a view of the course from the instructor’s perspective, see Dr. Ullyot’s presentation, Teaching Hamlet in the Humanities Lab, for the Renaissance Society of America conference this past March 2012.)
nlp  digitalhumanities  textanalysis 
15 days ago by rybesh
It’s the data: a plan of action. | The Stone and the Shell
What we need are collections in the 5,000 – 500,000 volume range, cleaned up to at least (say) 95% recall and 99% precision. Precision is more important than recall, because false negatives drop out of many kinds of analysis — as long as they’re randomly distributed (i.e. you can’t just ignore the f/s problem in the 18c). Collections of that kind are going to generate insights that we can’t glimpse as individual readers. They’ll be especially valuable once we enrich the metadata with information about (for instance) genre, gender, and nationality. I’m not confident that we can crowdsource OCR correction (it’s an awful lot of work), but I am confident that we could crowdsource some light enrichment of metadata.
digitalhumanities  ocr  digitization  textanalysis 
16 days ago by rybesh
JSTOR: The Journal of Modern History, Vol. 84, No. 1 (March 2012), pp. 116-144
by using multiple databases and keyword variants, the historian may gain confidence in a particular chronological intervention. Large databases, the result of scanned microfilm collections or mass digitization initiatives across multiple libraries, provide enough texts to bridge generation and genre, incorporating authors from a variety of backgrounds. Sheer number of texts is important here: ECCO indexes 200,000 works from eighteenth- and nineteenth-century Britain with 33 million pages of text; Google Books Search has 42 million books from all periods. If the historian’s goal is to show a shift in common word usage, the size of a database is more important than its genre specificity; in the case examined in the present article, for instance, Google Book Search and ECCO were superior to the available poetry databases. Iterative visitation of multiple databases provided another potential source of richness for extracting meaning from these tools.
textanalysis  search  digitalhumanities 
21 days ago by rybesh
The Myth of Text Analytics and Unobtrusive Measurement » the scottbot irregular
Text analytics are often used in the social sciences as a way of unobtrusively observing people and their interactions. Humanists tend to approach the supporting algorithms with skepticism, and with good reason. This post is about the difficulties of using words or counts as a proxy for some secondary or deeper meaning.
digitalhumanities  textanalysis 
25 days ago by rybesh
Visualizing Topic Models
Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method that learns the underlying themes in a large collection of otherwise unorganized documents. This discovered structure summarizes and organizes the documents. However, topic models are high-level statistical tools--a user must scrutinize numerical distributions to understand and explore their results. In this paper, we present a method for visualizing topic models. Our method creates a navigator of the documents, allowing users to explore the hidden structure that a topic model discovers. These browsing interfaces reveal meaningful patterns in a collection, helping end-users explore and understand its contents in new ways. We provide open source software of our method. Understanding and navigating large collections of documents has become an important activity in many spheres. However, many document collections are not coherently organized and organizing them by hand is impractical. We need automated ways to discover and visualize the structure of a collection in order to more easily explore its contents. Probabilistic topic modeling is a set of machine learning tools that may provide a solution (Blei and Lafferty 2009). Topic modeling algorithms discover a hidden thematic structure in a collection of documents; they find salient themes and represent each document as a combination of themes. However, topic models are high-level statistical tools. A user must scrutinize numerical distributions to understand and explore their results; the raw output of the model is not enough to create an easily explored corpus. We propose a method for using a fitted topic model to organize, summarize, visualize, and interact with a corpus. With our method, users can explore the corpus, moving between high level discovered summaries (the "topics") and the documents themselves, as Figure 1 illustrates.
topicmodels  textanalysis  infoviz  visualization 
9 weeks ago by rybesh
10 MILLION INTERNATIONAL DYADIC EVENTS
When the Palestinians launch a mortar attack into Israel, the Israeli army does not wait until the end of the calendar year to react. Yet, most modern data collections are aggregated to the month or year. The data available here include almost 10 million individual events, each coded to the exact day they occur or become known. Each event is summarized in the data as "Actor A does something to Actor B", with Actors A and B recording about 450 countries and other (within-country) actors and "does something to" coded in an ontology of about 200 types of actions. The data are coded by computer from millions of Reuters news reports. The software system (produced by VRA) that performs this task has been independently evaluated by King and Lowe (2003). This article found that for the numbers of events it was possible to convince humans (trained Harvard undergraduates) to code by hand, the machine did as well as the humans. For much larger numbers of events for which no expert coder could keep up, the machine dominates.
events  politicalscience  data  machinelearning  textanalysis 
10 weeks ago by rybesh
timjurka/RTextTools
RTextTools is a free, open source machine learning package for automatic text classification that makes it simple for both novice and advanced users to get started with supervised learning. The package includes nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, maximum entropy), comprehensive analytics, and thorough documentation.
textanalysis  classification  tools  research 
11 weeks ago by rybesh
Sp12-ENGLISH-162-01 : Critical Methods: Introduction to Digital Humanities
Digital texts and digital libraries offer us new opportunities for searching and accessing literary material. But more interesting and exciting than the mere searching of digital texts is the ability to leverage computation in order to process and analyze textual data, to provide new methods for reading, analyzing, and understanding literature.

This course provides an introduction to the field of humanities computing with a special emphasis on literary text-analysis. Students learn about the preparation and processing of digital texts while exploring literary methods which help us explain and interpret literary texts, genres, and movements. The course includes units dealing with "stylometry" (computer based stylistic analysis), authorship attribution, gender detection, text encoding, and the visualization of literary information using such open source tools as R and Gephi.

Throughout the course we consider the theoretical issues associated with employing quantitative methodologies in a traditionally qualitative discipline; we read and discuss landmark essays in the field; and we end with an informed discussion of how digital libraries and computation are taking literary scholarship "beyond the book." Students will develop basic coding skills in an environment in which understanding literature is the only prerequisite. No programming experience is required; students will develop fluency in XML and R through exercises and work on a collaborative text-analysis project.
digitalhumanities  syllabus  textanalysis 
11 weeks ago by rybesh
Automating Quantitative Narrative Analysis of News Data
We present a working system for large scale quantitative narrative analysis (QNA) of news corpora, which includes various recent ideas from text mining and pattern analysis in order to solve a problem arising in computational social sciences. The task is that of identifying the key actors in a body of news, and the actions they perform, so that further analysis can be carried out. This step is normally performed by hand and is very labour intensive. We then characterise the actors by: studying their position in the overall network of actors and actions; studying the time series associated with some of their properties; generating scatter plots describing the subject/object bias of each actor; and investigating the types of actions each actor is most associated with. The system is demonstrated on a set of 100,000 articles about crime appeared on the New York Times between 1987 and 2007. As an example, we nd that Men were most commonly responsible for crimes against the person, while Women and Children were most often victims of those crimes.
textanalysis  textmining  events  sociology  news 
12 weeks ago by rybesh
[1003.0783] Supervised Topic Models
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and the political tone of amendments in the U.S. Senate based on the amendment text. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.
slda  classification  lda  topicmodels  textanalysis  machinelearning 
12 weeks ago by rybesh
Supervised latent Dirichlet allocation for classification
This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.
c++  slda  classification  topicmodels  lda  machinelearning  textanalysis 
12 weeks ago by rybesh
Latent Dirichlet Allocation in C
This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .

This code contains:

an implementation of variational inference for the per-document topic proportions and per-word topic assignments
a variational EM procedure for estimating the topics and exchangeable Dirichlet hyperparameter
lda  c  linguistics  machinelearning  textanalysis  textmining 
12 weeks ago by rybesh
Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
Politics and political con ict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods: they are no substitute for careful thought and close reading and require extensive and problem speci c validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.
textanalysis  politicalscience  socialscience  digitalhumanities 
12 weeks ago by rybesh
Stanford Vis Group | Interpretation and Trust: Designing Model-Driven Visualizations for Text Analysis
Statistical topic models can help analysts discover patterns in large text corpora by identifying recurring sets of words and enabling exploration by topical concepts. However, understanding and validating the output of these models can itself be a challenging analysis task. In this paper, we offer two design considerations - interpretation and trust - for designing visualizations based on data-driven models. Interpretation refers to the facility with which an analyst makes inferences about the data through the lens of a model abstraction. Trust refers to the actual and perceived accuracy of an analyst's inferences. These considerations derive from our experiences developing the Stanford Dissertation Browser, a tool for exploring over 9,000 Ph.D. theses by topical similarity, and a subsequent review of existing literature. We contribute a novel similarity measure for text collections based on a notion of "word-borrowing" that arose from an iterative design process. Based on our experiences and a literature review, we distill a set of design recommendations and describe how they promote interpretable and trustworthy visual analysis tools.
infoviz  textanalysis  topicmodels 
february 2012 by rybesh
Robert Young - Text Understanding: A Survey
The goal of the study is to examine work that has something to offer toward the construction of a computable model of text understanding. It focuses on those aspects of meaning that are conveyed only by groups of connected sentences—texts. Additionally, only work that attempts to deal with the semantics or understanding of texts, as opposed to statistical or syntactic analysis, is considered.
nlp  textanalysis  semantics 
february 2012 by rybesh
Automatic text analytics using DBpedia and PoolParty – A Live Demo |The Semantic Puzzle
Let me show you which steps have to be taken to generate a high-quality text mining application, ready to be used to annotate and to categorize any kind of text or documents covering nearly any domain. With our approach of thesaurus based text mining your documents can also be linked to the world of linked (open) data; enrich your documents with data from the LOD cloud!
webinfo  inls520  semweb  textanalysis  classification  skos  tools 
february 2012 by rybesh
Diction Software - Home
Diction 6.0 uses dictionaries (word-lists) to search a text for these qualities:

· Certainty - Language indicating resoluteness, inflexibility, and completeness and a tendency to speak ex cathedra.

· Activity - Language featuring movement, change, the implementation of ideas and the avoidance of inertia.

· Optimism - Language endorsing some person, group, concept or event, or highlighting their positive entailments.

· Realism - Language describing tangible, immediate, recognizable matters that affect people's everyday lives.

· Commonality - Language highlighting the agreed-upon values of a group and rejecting idiosyncratic modes of engagement.
textanalysis  sentiment  digitalhumanities 
january 2012 by rybesh
Scale and Method: A Reply to Jeremy Rosen « Post45
The piece had two aims, namely to advocate for the addition of computational methods to our critical repertoire and to give a sample of recent computational work of the sort I find useful. I mention these goals up front because I think some of Rosen’s criticisms follow from the failure (mine, to be sure) to specify exactly what my essay was and was not doing and arguing. So to be clear: it was an argument for methodological expansion, especially for those of us working with contemporary sources, and a high-level synopsis of the results of that expansion.
literarystudies  textanalysis  digitalhumanities 
january 2012 by rybesh
Combining Close and Distant, or, the Utility of Genre Analysis: A Response to Matthew Wilkens’s “Contemporary Fiction by the Numbers” « Post45
Wilkens neglects other equally pressing problems with the computational practices he advocates—limitations that reveal themselves in the very analysis he proffers as a sample of the kind of scholarship such practices might enable. Two problems with Wilkens’s method strike me as most urgent. First and most glaringly, he inadvertently demonstrates how easily data may be misinterpreted to serve conclusions that are sought by the analyst. And second, though he and others doing similar work purport to offer analysis of neutral data sets—say, all the fiction published in a given year—by working with existing bibliographies they perpetuate the selection criteria that governed the initial compilation. Doing so artificially reifies bodies of texts that might in fact be far more heterogeneous and unruly.
literarystudies  digitalhumanities  textanalysis 
january 2012 by rybesh
Contemporary Fiction by the Numbers « Post45
A short illustration of the underlying problem of literary and cultural abundance, a quick tour of several techniques that we might use to expand our analytical repertoire so as to deal with that problem more effectively, and, finally, a consideration of the substantial challenges these methods face in the short-to-medium term.
literarystudies  textanalysis  digitalhumanities 
january 2012 by rybesh
A panlingual anomalous text detector
In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.
ocr  textanalysis  textmining  evalulation 
october 2011 by rybesh
Price-is-Right Binary Search (for Suffix Arrays of Documents) « LingPipe Blog
Suffix arrays are useful if you’re looking for anything from plagiarized passages in a pile of writing assignments, cut-and-paste code blocks in a large project, or just commonly repeated phrases on Twitter.
search  textanalysis  textmining 
june 2011 by rybesh
difflib — SequenceMatcher
SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
python  textanalysis 
may 2011 by rybesh
Python Package Index : python-Levenshtein 0.10.2
Python extension computing string distances and similarities.
python  textanalysis  search 
may 2011 by rybesh
Training Examples Q&A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
Where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!
ai  machinelearning  nlp  textanalysis  ir  datamining  search  statistics  infoviz  reference 
june 2010 by rybesh
Blegging for Help: Web Scraping for Content? « LingPipe Blog
In search of a good general-purpose method of pulling the content out of arbitrary web pages and leaving the boilerplate, advertising, navigation, etc. behind. See also http://bit.ly/4SFOIH
web  nlp  html  parsing  textanalysis 
january 2010 by rybesh
lda: Collapsed Gibbs sampling methods for topic models
This package implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel.
clustering  textanalysis  datamining  R  topicmodels 
november 2009 by rybesh

Copy this bookmark:



description:


tags: