cshalizi + text_mining 38
Quantitative patterns of stylistic influence in the evolution of literature
13 days ago by cshalizi
"Literature is a form of expression whose temporal structure, both in content and style, provides a historical record of the evolution of culture. In this work we take on a quantitative analysis of literary style and conduct the first large-scale temporal stylometric study of literature by using the vast holdings in the Project Gutenberg Digital Library corpus. We find temporal stylistic localization among authors through the analysis of the similarity structure in feature vectors derived from content-free word usage, nonhomogeneous decay rates of stylistic influence, and an accelerating rate of decay of influence among modern authors. Within a given time period we also find evidence for stylistic coherence with a given literary topic, such that writers in different fields adopt different literary styles. This study gives quantitative support to the notion of a literary “style of a time” with a strong trend toward increasingly contemporaneous stylistic influence."
It'll be interesting to see how they handle the bias induced by selective retention.
to:NB
to_read
literary_history
text_mining
kith_and_kin
rockmore.dan
krakuer.david
It'll be interesting to see how they handle the bias induced by selective retention.
13 days ago by cshalizi
[1204.6703] Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation
27 days ago by cshalizi
"Topic models can be seen as a generalization of the clustering problem, in that they posit that observations are generated due to multiple latent factors (e.g. the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden.
"We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e. third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k by k matrices, where k is the number of latent factors (e.g. the number of topics), rather than in the d-dimensional observed space (typically d >> k)."
That's a really remarkable claim, and I'd tag it to_be_shot_after_a_fair_trial if it weren't being made by genuinely serious people.
in_NB
to_read
latent_variables
topic_models
text_mining
mixture_models
statistics
machine_learning
cool_if_true
spectral_clustering
"We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e. third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k by k matrices, where k is the number of latent factors (e.g. the number of topics), rather than in the d-dimensional observed space (typically d >> k)."
That's a really remarkable claim, and I'd tag it to_be_shot_after_a_fair_trial if it weren't being made by genuinely serious people.
27 days ago by cshalizi
Why DH has no future. | The Stone and the Shell
5 weeks ago by cshalizi
Let me just say that any area of scholarship where, in 20-fucking-12, the idea of moving to open-access, online distribution of writing counts as some kind of radicalism deserves everything that's going to happen to it.
digital_humanities
intellectual_movements
humanities
academia
text_mining
5 weeks ago by cshalizi
[1204.2523] Concept Modeling with Superwords
6 weeks ago by cshalizi
"In information retrieval, a fundamental goal is to transform a document into concepts that are representative of its content. The term "representative" is in itself challenging to define, and various tasks require different granularities of concepts. In this paper, we aim to model concepts that are sparse over the vocabulary, and that flexibly adapt their content based on other relevant semantic information such as textual structure or associated image features. We explore a Bayesian nonparametric model based on nested beta processes that allows for inferring an unknown number of strictly sparse concepts. The resulting model provides an inherently different representation of concepts than a standard LDA (or HDP) based topic model, and allows for direct incorporation of semantic features. We demonstrate the utility of this representation on multilingual blog data and the Congressional Record."
in_NB
to_read
text_mining
topic_models
fox.emily
guestrin.carlos
kith_and_kin
6 weeks ago by cshalizi
[0805.2490] Using statistical smoothing to date medieval manuscripts
12 weeks ago by cshalizi
"We discuss the use of multivariate kernel smoothing methods to date manuscripts dating from the 11th to the 15th centuries, in the English county of Essex. The dataset consists of some 3300 dated and 5000 undated manuscripts, and the former are used as a training sample for imputing dates for the latter. It is assumed that two manuscripts that are ``close'', in a sense that may be defined by a vector of measures of distance for documents, will have close dates. Using this approach, statistical ideas are used to assess ``similarity'', by smoothing among distance measures, and thus to estimate dates for the 5000 undated manuscripts by reference to the dated ones."
Can we get data?
to:NB
statistics
smoothing
kernel_estimators
medieval_european_history
text_mining
to_teach:undergrad-ADA
Can we get data?
12 weeks ago by cshalizi
Do humanists get their ideas from anything at all? | The Stone and the Shell
january 2012 by cshalizi
"The basic mistake that Fish is making is this: he pretends that humanists have no discovery process at all. For Fish, the interpretive act is always fully contained in an encounter with a single piece of evidence. How your “interpretive proposition” got framed in the first place is a matter of no consequence; some readers are just fortunate to have propositions that turn out to be correct. Fish is not alone in this idealized model of interpretation; it’s widespread among humanists."
literary_criticism
humanities
discovery_vs_justification
text_mining
january 2012 by cshalizi
UI Press | Stephen Ramsay | Reading Machines: Toward an Algorithmic Criticism
january 2012 by cshalizi
"Besides familiar and now-commonplace tasks that computers do all the time, what else are they capable of? Stephen Ramsay's intriguing study of computational text analysis examines how computers can be used as "reading machines" to open up entirely new possibilities for literary critics. Computer-based text analysis has been employed for the past several decades as a way of searching, collating, and indexing texts. Despite this, the digital revolution has not penetrated the core activity of literary studies: interpretive analysis of written texts."
in_NB
books:noted
literary_criticism
text_mining
via:timothy-burke
january 2012 by cshalizi
Graph-based Natural Language Processing and Information Retrieval - Mihaclea and Radev
december 2011 by cshalizi
"Graph theory and the fields of natural language processing and information retrieval are well-studied disciplines. Traditionally, these areas have been perceived as distinct, with different algorithms, different applications, and different potential end-users. However, recent research has shown that these disciplines are intimately connected, with a large variety of natural language processing and information retrieval applications finding efficient solutions within graph-theoretical frameworks. This book extensively covers the use of graph-based algorithms for natural language processing and information retrieval. It brings together topics as diverse as lexical semantics, text summarization, text mining, ontology construction, text classification, and information retrieval, which are connected by the common underlying theme of the use of graph-theoretical methods for text and information processing tasks. Readers will come away with a firm understanding of the major methods and applications in natural language processing and information retrieval that rely on graph-based representations and algorithms."
in_NB
books:noted
natural_language_processing
graph_theory
data_mining
text_mining
radev.dragomir
december 2011 by cshalizi
The structure of science information (Harris, 2002)
december 2011 by cshalizi
"The organization of information within science can be investigated in a principled way through analysis of science language. The restricted use of language in science enables description of the informational structure of science and of particular subfields, with strong similarities to structures in mathematics and programming languages. This result rests on decades of research into the relation between form and content in language, based on an information-theoretic approach to the structure of information. Examples are provided from immunology and the social sciences. Practical applications include storage of science information in databases, indexing the literature, and identification and resolution of controversy."
to:NB
linguistics
text_mining
natural_language_processing
harris.zellig
information_retrieval
december 2011 by cshalizi
[1112.1115] Social-Topical Affiliations: The Interplay between Structure and Popularity
december 2011 by cshalizi
"Information popularity and social relationships are intimately connected. However, measuring the extent to which they affect each other has remained an open question. Because we now have access to rich and large data sets from online social networks, we can begin to quantitatively understand the interplay between them. We examine the interface of two decisive structures forming the backbone of online social media: the graph structure of social networks - who is friends with whom - and the set structure of topical affiliations - who talks about what. In studying this interface, we identify key relationships whereby each of these structures can be understood in terms of the other. The context for our study is Twitter, where we look at the social network of both follower relationships and communication relationships, alongside the affiliations outlined by the hashtags used by people to label their communications. On Twitter, we demonstrate how the hashtags that a user adopts can be used to predict their social relationships, and also how the social relationships between the adopters of a hashtag can be used to predict the future popularity of that hashtag. Importantly, we find that both relationships are driven by highly computationally simple structural determinants. While our analysis focuses on Twitter, we view our analysis of social-topical affiliations as broadly applicable to a host of diverse affiliations, including the movies people watch, the brands people like, or the locations people frequent."
in_NB
network_data_analysis
social_media
text_mining
community_discovery
december 2011 by cshalizi
[0809.2792] Predicting Abnormal Returns From News Using Text Classification
december 2011 by cshalizi
"We show how text from news articles can be used to predict intraday price movements of financial assets using support vector machines. Multiple kernel learning is used to combine equity returns with text as predictive features to increase classification performance and we develop an analytic center cutting plane method to solve the kernel learning problem efficiently. We observe that while the direction of returns is not predictable using either text or returns, their size is, with text features producing significantly better performance than historical returns alone."
to:NB
have_read
financial_speculation
text_mining
december 2011 by cshalizi
[1110.4713] Kernel Topic Models
october 2011 by cshalizi
"Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents' mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas."
to:NB
machine_learning
topic_models
hilbert_space
text_mining
kernel_methods
october 2011 by cshalizi
The Uses of Analogies in 17th and 18th Century Science
april 2011 by cshalizi
"The object of this paper is to look at the extent and nature of the uses of analogy during the first century following the so-called scientific revolution. Using the research tool provided by JSTOR we systematically analyze the uses of “analog” and its cognates (analogies, analogous, etc.) in the Philosophical Transactions of the Royal Society of London for the period 1665–1780. In addition to giving the possibility of evaluating quantitatively the proportion of papers explicitly using analogies, this approach makes it possible to go beyond the maybe idiosyncratic cases of Descartes, Kepler, Galileo, and other much studied giants of the so-called Scientific Revolution..." --- But you could make all kinds of analogies without using the word "analogy"!
scientific_revolution
text_mining
history_of_science
analogy
to:NB
april 2011 by cshalizi
Adventures in Data Land, Graphical Models for the Internet
march 2011 by cshalizi
Look at this later and re-consider the to_teach tags.
clustering
graphical_models
tutorials
expectation-maximization
internet
text_mining
to_teach:data-mining
to_teach:undergrad-ADA
smola.alex
ahmed.amr
heard_the_talk
march 2011 by cshalizi
Building a Better Word Cloud « Zero Intelligence Agents
february 2011 by cshalizi
I like the point that the axes in a plot should _mean_ something. Not sure that these are the best choices however --- what if I want to just deal with one document, or for that matter with three?
visual_display_of_quantitative_information
text_mining
february 2011 by cshalizi
Robin Valenza - People: Department of English, UW–Madison
july 2010 by cshalizi
"A distinctive feature of this project's methodology will be its use of large-scale full-text digital archives and tools for analyzing and classifying large amounts of "dirty" data (from over 100,000 books) alongside more traditional modes of close reading. I refer to this data as dirty because the scans of the pages have not been checked for accuracy. Researchers working with this sort of data need to use statistical methods to allow for the inevitable machine-generated error in such a process. Using such databases alongside more traditional modes of reading will give the project a broader range of texts to analyze and from which to draw conclusions. ... " Moretti's student?
literary_history
text_mining
via:jse
july 2010 by cshalizi
[1002.4665] Syntactic Topic Models
march 2010 by cshalizi
Including parse trees from another model feels like cheating.
text_mining
topic_models
blei.david
to:NB
march 2010 by cshalizi
[1003.0783] Supervised Topic Models
march 2010 by cshalizi
What a coincidence, some of the kids in 490 have labeled documents...
latent_dirichlet_allocation
text_mining
classifiers
machine_learning
statistics
to_teach:data-mining
to_teach:undergrad-research
topic_models
blei.david
march 2010 by cshalizi
Measuring Differentiability: Unmasking Pseudonymous Authors
february 2010 by cshalizi
Could you use this to distinguish genres or styles rather than authors? "In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the "depth of difference" between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process."
text_mining
author-identification
machine_learning
to:NB
february 2010 by cshalizi
Language Log » Embuggerance & Feisty
october 2009 by cshalizi
"Perhaps we should continue the tradition of metonymic names for new linguistic natural kinds (e.g. eggcorn and crash blossom), and use embuggerance for cases where the automatic tagging of entities and relations goes astray." (I think however that actually using this example in my classroom might strain the bounds of good taste, even more than the image search results for "kitten".)
natural_language_processing
funny:geeky
funny:malicious
text_mining
to_teach:data-mining
october 2009 by cshalizi
UC Berkeley Enron Email Analysis
august 2009 by cshalizi
With hand-labeled categories.
enron
email
text_mining
information_retrieval
fraud
to_teach:data-mining
august 2009 by cshalizi
LDC Catalog: New York Times Annotated Corpus
august 2009 by cshalizi
Sounds like it would be perfect for 350. Now how the **** do I get access?
information_retrieval
text_mining
newspapers
data_sets
to_teach:data-mining
via:myl
august 2009 by cshalizi
ReadMe: Software for Automated Text Analysis
june 2009 by cshalizi
"The ReadMe software package for R takes as input a set of text documents (such as speeches, blog posts, newspaper articles, judicial opinions, movie reviews, etc.), a categorization scheme chosen by the user (e.g., ordered positive to negative sentiment ratings, unordered policy topics, or any other mutually exclusive and exhaustive set of categories), and a small subset of text documents hand classified into the given categories. If used properly, ReadMe will report, normally within sampling error of the truth, the proportion of documents within each of the given categories among those not hand coded. ReadMe computes quantities of interest to the scientific community based on the distribution within categories but does so by skipping the more error prone intermediate step of classifing individual documents. Other procedures are also included to make processing text easy."
to_teach:data-mining
text_mining
content_analysis
R
software
linguistics
statistics
via:chl
king.gary
june 2009 by cshalizi
Allen Riddell : Quantitative Stylistics Resources
march 2009 by cshalizi
I actually found this looking for more stuff by Moretti, but some of the references look like they might make teaching fodder.
text_mining
to_teach:data-mining
track_down_references
literary_criticism
stylistics
march 2009 by cshalizi
Beyond Proportional Analogy « Apperceptual
december 2008 by cshalizi
Latent semantic indexing applied to learning analogies, and systems of analogies. Conceptually simple, psychologically implausible, but extremely impressive.
latent_semantic_analysis
analogy
AI
turney.peter
track_down_references
text_mining
to_teach:data-mining
december 2008 by cshalizi
An Inquiry into the Nature and Causes of the Wealth of Internet Miscreants
october 2007 by cshalizi
Some measurements of an on-line marketplace (IRC channel) relating to electronic crimes, with SVMs to label "semantic" features
content_analysis
fraud
text_mining
data_mining
spam
economics
via:schneier
carnegie_mellon
october 2007 by cshalizi
related tags
academia ⊕ ahmed.amr ⊕ AI ⊕ analogy ⊕ anthropology ⊕ author-identification ⊕ bad_data_analysis ⊕ bibliometry ⊕ blei.david ⊕ books:noted ⊕ burke.timothy ⊕ carnegie_mellon ⊕ classifiers ⊕ clustering ⊕ community_discovery ⊕ computational_humanities ⊕ conferences ⊕ content_analysis ⊕ cool_if_true ⊕ data_mining ⊕ data_sets ⊕ deceiving_us_has_become_an_industrial_process ⊕ digital_humanities ⊕ discovery_vs_justification ⊕ economics ⊕ email ⊕ enron ⊕ expectation-maximization ⊕ financial_speculation ⊕ fox.emily ⊕ fraud ⊕ funny:geeky ⊕ funny:malicious ⊕ graphical_models ⊕ graph_theory ⊕ griffiths.thomas ⊕ guestrin.carlos ⊕ harris.zellig ⊕ have_read ⊕ heard_the_talk ⊕ hilbert_space ⊕ history_of_science ⊕ humanities ⊕ human_terrain_system ⊕ information_retrieval ⊕ intellectual_movements ⊕ internet ⊕ in_NB ⊕ jordan.michael_i. ⊕ kernel_estimators ⊕ kernel_methods ⊕ king.gary ⊕ kith_and_kin ⊕ krakuer.david ⊕ latent_dirichlet_allocation ⊕ latent_semantic_analysis ⊕ latent_variables ⊕ linguistics ⊕ literary_criticism ⊕ literary_history ⊕ machine_learning ⊕ medieval_european_history ⊕ mixture_models ⊕ natural_language_processing ⊕ networks ⊕ network_data_analysis ⊕ newspapers ⊕ plagiarism ⊕ principal_components ⊕ public_relations ⊕ R ⊕ radev.dragomir ⊕ rockmore.dan ⊕ scientific_revolution ⊕ semantics_from_syntax ⊕ sentiment_analysis ⊕ smola.alex ⊕ smoothing ⊕ smyth.padhraic ⊕ social_media ⊕ social_science_methodology ⊕ sociology_of_science ⊕ software ⊕ spam ⊕ sparsity ⊕ spectral_clustering ⊕ statistics ⊕ stylistics ⊕ teaching ⊕ text_mining ⊖ the_continuing_crises ⊕ time_series ⊕ to:NB ⊕ topic_models ⊕ to_read ⊕ to_teach:data-mining ⊕ to_teach:undergrad-ADA ⊕ to_teach:undergrad-research ⊕ track_down_references ⊕ turney.peter ⊕ turnitin ⊕ tutorials ⊕ via:chl ⊕ via:jse ⊕ via:klk ⊕ via:myl ⊕ via:nicholas_della_penna ⊕ via:schneier ⊕ via:timothy-burke ⊕ via:tomslee ⊕ visual_display_of_quantitative_information ⊕ welling.max ⊕Copy this bookmark: