cshalizi + text_mining   38

Quantitative patterns of stylistic influence in the evolution of literature
"Literature is a form of expression whose temporal structure, both in content and style, provides a historical record of the evolution of culture. In this work we take on a quantitative analysis of literary style and conduct the first large-scale temporal stylometric study of literature by using the vast holdings in the Project Gutenberg Digital Library corpus. We find temporal stylistic localization among authors through the analysis of the similarity structure in feature vectors derived from content-free word usage, nonhomogeneous decay rates of stylistic influence, and an accelerating rate of decay of influence among modern authors. Within a given time period we also find evidence for stylistic coherence with a given literary topic, such that writers in different fields adopt different literary styles. This study gives quantitative support to the notion of a literary “style of a time” with a strong trend toward increasingly contemporaneous stylistic influence."

It'll be interesting to see how they handle the bias induced by selective retention.
to:NB  to_read  literary_history  text_mining  kith_and_kin  rockmore.dan  krakuer.david 
13 days ago by cshalizi
[1204.6703] Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation
"Topic models can be seen as a generalization of the clustering problem, in that they posit that observations are generated due to multiple latent factors (e.g. the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden.
"We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e. third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k by k matrices, where k is the number of latent factors (e.g. the number of topics), rather than in the d-dimensional observed space (typically d >> k)."

That's a really remarkable claim, and I'd tag it to_be_shot_after_a_fair_trial if it weren't being made by genuinely serious people.
in_NB  to_read  latent_variables  topic_models  text_mining  mixture_models  statistics  machine_learning  cool_if_true  spectral_clustering 
27 days ago by cshalizi
Why DH has no future. | The Stone and the Shell
Let me just say that any area of scholarship where, in 20-fucking-12, the idea of moving to open-access, online distribution of writing counts as some kind of radicalism deserves everything that's going to happen to it.
digital_humanities  intellectual_movements  humanities  academia  text_mining 
5 weeks ago by cshalizi
[1204.2523] Concept Modeling with Superwords
"In information retrieval, a fundamental goal is to transform a document into concepts that are representative of its content. The term "representative" is in itself challenging to define, and various tasks require different granularities of concepts. In this paper, we aim to model concepts that are sparse over the vocabulary, and that flexibly adapt their content based on other relevant semantic information such as textual structure or associated image features. We explore a Bayesian nonparametric model based on nested beta processes that allows for inferring an unknown number of strictly sparse concepts. The resulting model provides an inherently different representation of concepts than a standard LDA (or HDP) based topic model, and allows for direct incorporation of semantic features. We demonstrate the utility of this representation on multilingual blog data and the Congressional Record."
in_NB  to_read  text_mining  topic_models  fox.emily  guestrin.carlos  kith_and_kin 
6 weeks ago by cshalizi
[0805.2490] Using statistical smoothing to date medieval manuscripts
"We discuss the use of multivariate kernel smoothing methods to date manuscripts dating from the 11th to the 15th centuries, in the English county of Essex. The dataset consists of some 3300 dated and 5000 undated manuscripts, and the former are used as a training sample for imputing dates for the latter. It is assumed that two manuscripts that are ``close'', in a sense that may be defined by a vector of measures of distance for documents, will have close dates. Using this approach, statistical ideas are used to assess ``similarity'', by smoothing among distance measures, and thus to estimate dates for the 5000 undated manuscripts by reference to the dated ones."

Can we get data?
to:NB  statistics  smoothing  kernel_estimators  medieval_european_history  text_mining  to_teach:undergrad-ADA 
12 weeks ago by cshalizi
Do humanists get their ideas from anything at all? | The Stone and the Shell
"The basic mistake that Fish is making is this: he pretends that humanists have no discovery process at all. For Fish, the interpretive act is always fully contained in an encounter with a single piece of evidence. How your “interpretive proposition” got framed in the first place is a matter of no consequence; some readers are just fortunate to have propositions that turn out to be correct. Fish is not alone in this idealized model of interpretation; it’s widespread among humanists."
literary_criticism  humanities  discovery_vs_justification  text_mining 
january 2012 by cshalizi
UI Press | Stephen Ramsay | Reading Machines: Toward an Algorithmic Criticism
"Besides familiar and now-commonplace tasks that computers do all the time, what else are they capable of? Stephen Ramsay's intriguing study of computational text analysis examines how computers can be used as "reading machines" to open up entirely new possibilities for literary critics. Computer-based text analysis has been employed for the past several decades as a way of searching, collating, and indexing texts. Despite this, the digital revolution has not penetrated the core activity of literary studies: interpretive analysis of written texts."
in_NB  books:noted  literary_criticism  text_mining  via:timothy-burke 
january 2012 by cshalizi
Graph-based Natural Language Processing and Information Retrieval - Mihaclea and Radev
"Graph theory and the fields of natural language processing and information retrieval are well-studied disciplines. Traditionally, these areas have been perceived as distinct, with different algorithms, different applications, and different potential end-users. However, recent research has shown that these disciplines are intimately connected, with a large variety of natural language processing and information retrieval applications finding efficient solutions within graph-theoretical frameworks. This book extensively covers the use of graph-based algorithms for natural language processing and information retrieval. It brings together topics as diverse as lexical semantics, text summarization, text mining, ontology construction, text classification, and information retrieval, which are connected by the common underlying theme of the use of graph-theoretical methods for text and information processing tasks. Readers will come away with a firm understanding of the major methods and applications in natural language processing and information retrieval that rely on graph-based representations and algorithms."
in_NB  books:noted  natural_language_processing  graph_theory  data_mining  text_mining  radev.dragomir 
december 2011 by cshalizi
The structure of science information (Harris, 2002)
"The organization of information within science can be investigated in a principled way through analysis of science language. The restricted use of language in science enables description of the informational structure of science and of particular subfields, with strong similarities to structures in mathematics and programming languages. This result rests on decades of research into the relation between form and content in language, based on an information-theoretic approach to the structure of information. Examples are provided from immunology and the social sciences. Practical applications include storage of science information in databases, indexing the literature, and identification and resolution of controversy."
to:NB  linguistics  text_mining  natural_language_processing  harris.zellig  information_retrieval 
december 2011 by cshalizi
[1112.1115] Social-Topical Affiliations: The Interplay between Structure and Popularity
"Information popularity and social relationships are intimately connected. However, measuring the extent to which they affect each other has remained an open question. Because we now have access to rich and large data sets from online social networks, we can begin to quantitatively understand the interplay between them. We examine the interface of two decisive structures forming the backbone of online social media: the graph structure of social networks - who is friends with whom - and the set structure of topical affiliations - who talks about what. In studying this interface, we identify key relationships whereby each of these structures can be understood in terms of the other. The context for our study is Twitter, where we look at the social network of both follower relationships and communication relationships, alongside the affiliations outlined by the hashtags used by people to label their communications. On Twitter, we demonstrate how the hashtags that a user adopts can be used to predict their social relationships, and also how the social relationships between the adopters of a hashtag can be used to predict the future popularity of that hashtag. Importantly, we find that both relationships are driven by highly computationally simple structural determinants. While our analysis focuses on Twitter, we view our analysis of social-topical affiliations as broadly applicable to a host of diverse affiliations, including the movies people watch, the brands people like, or the locations people frequent."
in_NB  network_data_analysis  social_media  text_mining  community_discovery 
december 2011 by cshalizi
[0809.2792] Predicting Abnormal Returns From News Using Text Classification
"We show how text from news articles can be used to predict intraday price movements of financial assets using support vector machines. Multiple kernel learning is used to combine equity returns with text as predictive features to increase classification performance and we develop an analytic center cutting plane method to solve the kernel learning problem efficiently. We observe that while the direction of returns is not predictable using either text or returns, their size is, with text features producing significantly better performance than historical returns alone."
to:NB  have_read  financial_speculation  text_mining 
december 2011 by cshalizi
[1110.4713] Kernel Topic Models
"Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents' mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas."
to:NB  machine_learning  topic_models  hilbert_space  text_mining  kernel_methods 
october 2011 by cshalizi
The Uses of Analogies in 17th and 18th Century Science
"The object of this paper is to look at the extent and nature of the uses of analogy during the first century following the so-called scientific revolution. Using the research tool provided by JSTOR we systematically analyze the uses of “analog” and its cognates (analogies, analogous, etc.) in the Philosophical Transactions of the Royal Society of London for the period 1665–1780. In addition to giving the possibility of evaluating quantitatively the proportion of papers explicitly using analogies, this approach makes it possible to go beyond the maybe idiosyncratic cases of Descartes, Kepler, Galileo, and other much studied giants of the so-called Scientific Revolution..." --- But you could make all kinds of analogies without using the word "analogy"!
scientific_revolution  text_mining  history_of_science  analogy  to:NB 
april 2011 by cshalizi
Building a Better Word Cloud « Zero Intelligence Agents
I like the point that the axes in a plot should _mean_ something.  Not sure that these are the best choices however --- what if I want to just deal with one document, or for that matter with three?
visual_display_of_quantitative_information  text_mining 
february 2011 by cshalizi
Robin Valenza - People: Department of English, UW–Madison
"A distinctive feature of this project's methodology will be its use of large-scale full-text digital archives and tools for analyzing and classifying large amounts of "dirty" data (from over 100,000 books) alongside more traditional modes of close reading. I refer to this data as dirty because the scans of the pages have not been checked for accuracy. Researchers working with this sort of data need to use statistical methods to allow for the inevitable machine-generated error in such a process. Using such databases alongside more traditional modes of reading will give the project a broader range of texts to analyze and from which to draw conclusions. ... " Moretti's student?
literary_history  text_mining  via:jse 
july 2010 by cshalizi
[1002.4665] Syntactic Topic Models
Including parse trees from another model feels like cheating.
text_mining  topic_models  blei.david  to:NB 
march 2010 by cshalizi
Measuring Differentiability: Unmasking Pseudonymous Authors
Could you use this to distinguish genres or styles rather than authors? "In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the "depth of difference" between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process."
text_mining  author-identification  machine_learning  to:NB 
february 2010 by cshalizi
Language Log » Embuggerance & Feisty
"Perhaps we should continue the tradition of metonymic names for new linguistic natural kinds (e.g. eggcorn and crash blossom), and use embuggerance for cases where the automatic tagging of entities and relations goes astray." (I think however that actually using this example in my classroom might strain the bounds of good taste, even more than the image search results for "kitten".)
natural_language_processing  funny:geeky  funny:malicious  text_mining  to_teach:data-mining 
october 2009 by cshalizi
ReadMe: Software for Automated Text Analysis
"The ReadMe software package for R takes as input a set of text documents (such as speeches, blog posts, newspaper articles, judicial opinions, movie reviews, etc.), a categorization scheme chosen by the user (e.g., ordered positive to negative sentiment ratings, unordered policy topics, or any other mutually exclusive and exhaustive set of categories), and a small subset of text documents hand classified into the given categories. If used properly, ReadMe will report, normally within sampling error of the truth, the proportion of documents within each of the given categories among those not hand coded. ReadMe computes quantities of interest to the scientific community based on the distribution within categories but does so by skipping the more error prone intermediate step of classifing individual documents. Other procedures are also included to make processing text easy."
to_teach:data-mining  text_mining  content_analysis  R  software  linguistics  statistics  via:chl  king.gary 
june 2009 by cshalizi
Allen Riddell : Quantitative Stylistics Resources
I actually found this looking for more stuff by Moretti, but some of the references look like they might make teaching fodder.
text_mining  to_teach:data-mining  track_down_references  literary_criticism  stylistics 
march 2009 by cshalizi
Beyond Proportional Analogy « Apperceptual
Latent semantic indexing applied to learning analogies, and systems of analogies. Conceptually simple, psychologically implausible, but extremely impressive.
latent_semantic_analysis  analogy  AI  turney.peter  track_down_references  text_mining  to_teach:data-mining 
december 2008 by cshalizi
An Inquiry into the Nature and Causes of the Wealth of Internet Miscreants
Some measurements of an on-line marketplace (IRC channel) relating to electronic crimes, with SVMs to label "semantic" features
content_analysis  fraud  text_mining  data_mining  spam  economics  via:schneier  carnegie_mellon 
october 2007 by cshalizi

related tags

academia  ahmed.amr  AI  analogy  anthropology  author-identification  bad_data_analysis  bibliometry  blei.david  books:noted  burke.timothy  carnegie_mellon  classifiers  clustering  community_discovery  computational_humanities  conferences  content_analysis  cool_if_true  data_mining  data_sets  deceiving_us_has_become_an_industrial_process  digital_humanities  discovery_vs_justification  economics  email  enron  expectation-maximization  financial_speculation  fox.emily  fraud  funny:geeky  funny:malicious  graphical_models  graph_theory  griffiths.thomas  guestrin.carlos  harris.zellig  have_read  heard_the_talk  hilbert_space  history_of_science  humanities  human_terrain_system  information_retrieval  intellectual_movements  internet  in_NB  jordan.michael_i.  kernel_estimators  kernel_methods  king.gary  kith_and_kin  krakuer.david  latent_dirichlet_allocation  latent_semantic_analysis  latent_variables  linguistics  literary_criticism  literary_history  machine_learning  medieval_european_history  mixture_models  natural_language_processing  networks  network_data_analysis  newspapers  plagiarism  principal_components  public_relations  R  radev.dragomir  rockmore.dan  scientific_revolution  semantics_from_syntax  sentiment_analysis  smola.alex  smoothing  smyth.padhraic  social_media  social_science_methodology  sociology_of_science  software  spam  sparsity  spectral_clustering  statistics  stylistics  teaching  text_mining  the_continuing_crises  time_series  to:NB  topic_models  to_read  to_teach:data-mining  to_teach:undergrad-ADA  to_teach:undergrad-research  track_down_references  turney.peter  turnitin  tutorials  via:chl  via:jse  via:klk  via:myl  via:nicholas_della_penna  via:schneier  via:timothy-burke  via:tomslee  visual_display_of_quantitative_information  welling.max 

Copy this bookmark:



description:


tags: