Vaguery + natural-language-processing   16

A Picture of Language - NYTimes.com
"The book was enormously popular, and Mr. Reed and Mr. Brainerd’s diagramming swept through American schools like a refreshing breeze. By the latter half of the 19th century, chalkboards had become increasingly common in classrooms; for students, the impact of watching a sentence take shape on that large surface as a comprehensible, often elegant, and sometimes downright ingenious drawing must have been significant. It’s hard to believe anyone but the most dedicated pedant could have actually enjoyed parsing, but plenty of students — including me — loved diagramming.

A century and a half later, diagramming sentences is even more out of date than writing lessons on a piece of slate. When the book I wrote about it was published in 2006, a couple of hundred people sent me e-mails. One writer accused me of succumbing to Stockholm syndrome because I wrote so benignly about the nun who brainwashed me into thinking diagramming was fun. Another asked me for a date. Two objected to my political attitudes, as they deduced them between the lines. A dozen or so either faulted some of the diagrams or challenged me with a particularly tricky sentence."
grammar  pedagogy  styles-of-thinking  sentence-diagrams  mathematical-recreations  natural-language-processing  it-was-fun 
8 weeks ago by Vaguery
[1112.6045] Comparing intermittency and network measurements of words and their dependency on authorship
Many features from texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books from 8 authors who lived in the 19th and 20th centuries, for which the following network measurements were obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that the two factors with stronger dependency on the authors were the skewness in the distribution of word intermittency and the average shortest paths. Other factors such as the betweeness and the Zipf's law exponent show only weak dependency on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was a ca. 65 % accuracy upon combining complex network and intermittency features with the nearest neighbor algorithm. From a detailed analysis of the interdependence of the various metrics it is concluded that the methods used here are complementary for providing short- and long-scale perspectives of texts, which are useful for applications such as identification of topical words and information retrieval.
natural-language-processing  document-clustering  clustering  feature-selection  algorithms  nudge-targets 
january 2012 by Vaguery
[1110.1391] A Comparison of Different Machine Transliteration Models
"Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance."
natural-language-processing  machine-learning  review  nudge-targets 
october 2011 by Vaguery
[1106.5264] Acquiring Correct Knowledge for Natural Language Generation
"Natural language generation (NLG) systems are computer software systems that produce texts in English and other human languages, often from non-linguistic input data. NLG systems, like most AI systems, need substantial amounts of knowledge. However, our experience in two NLG projects suggests that it is difficult to acquire correct knowledge for NLG systems; indeed, every knowledge acquisition (KA) technique we tried had significant problems. In general terms, these problems were due to the complexity, novelty, and poorly understood nature of the tasks our systems attempted, and were worsened by the fact that people write so differently. This meant in particular that corpus-based KA approaches suffered because it was impossible to assemble a sizable corpus of high-quality consistent manually written texts in our domains; and structured expert-oriented KA techniques suffered because experts disagreed and because we could not get enough information about special and unusual cases to build robust systems. We believe that such problems are likely to affect many other NLG systems as well. In the long term, we hope that new KA techniques may emerge to help NLG system builders. In the shorter term, we believe that understanding how individual KA techniques can fail, and using a mixture of different KA techniques with different strengths and weaknesses, can help developers acquire NLG knowledge that is mostly correct."
natural-language-processing  artificial-intelligence  interesting-problems  high-hanging-fruit  machine-learning  nudge-targets 
october 2011 by Vaguery
[1107.1322] Text Classification: A Sequential Reading Approach
"We propose to model the text classification process as a sequential decision process. In this process, an agent learns to classify documents into topics while reading the document sentences sequentially and learns to stop as soon as enough information was read for deciding. The proposed algorithm is based on a modelisation of Text Classification as a Markov Decision Process and learns by using Reinforcement Learning. Experiments on four different classical mono-label corpora show that the proposed approach performs comparably to classical SVM approaches for large training sets, and better for small training sets. In addition, the model automatically adapts its reading process to the quantity of training information provided."
text-classification  natural-language-processing  machine-learning  nudge-targets 
august 2011 by Vaguery
ashleyw/phrasie - GitHub
Determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.
Ruby  library  tagging  natural-language-processing  NLP  statistics  text-mining 
may 2011 by Vaguery
[1007.3254] Distinguishing Fact from Fiction: Pattern Recognition in Texts Using Complex Networks
"We establish concrete mathematical criteria to distinguish between different kinds of written storytelling, fictional and non-fictional. Specifically, we constructed a semantic network from both novels and news stories, with $N$ independent words as vertices or nodes, and edges or links allotted to words occurring within $m$ places of a given vertex; we call $m$ the word distance. We then used measures from complex network theory to distinguish between news and fiction, studying the minimal text length needed as well as the optimized word distance $m$. The literature samples were found to be most effectively represented by their corresponding power laws over degree distribution $P(k)$ and clustering coefficient $C(k)$; we also studied the mean geodesic distance, and found all our texts were small-world networks.…"
nudge-targets  computational-linguistics  linguistics  classification  machine-learning  statistics  natural-language-processing 
august 2010 by Vaguery
CASS
"In the social sciences, it is useful to understand the relative similarities of concepts that are embedded in a particular text (from a particular group or a particular person). For example, in trying to estimate conservative bias in FoxNews, one might estimate its tendency to associate conservative concepts (conservative, republican) and good concepts (good, positive, etc.), compared to conservative and bad concepts. The output would indicate conservative favoritism. This comparison could be further refined by taking into account important "baseline" information about the valences associated with liberal, namely liberal and good in comparison to liberal and bad.…"
text-mining  natural-language-processing  data-mining  machine-learning  Ruby  library 
june 2010 by Vaguery
USPTO Bulk Downloads
"Google and the USPTO have entered into an agreement to make the following USPTO products available to the public at no charge:

Patents (grants, applications, assignments, classification information, and maintenance fee events)
Trademarks (grants, applications, assignments, and TTAB proceedings)

All data originated from the USPTO. Google is hosting this data unchanged, except for repackaging into zip files."
patents  intellectual-property  open-access  raw-data-now  government2.0  social-networks  law  datasets  nudge-targets  natural-language-processing  manfred-macx-approves 
june 2010 by Vaguery
[1005.5516] On the Fly Query Entity Decomposition Using Snippets
"One of the most important issues in Information Retrieval is inferring the intents underlying users' queries. Thus, any tool to enrich or to better contextualized queries can proof extremely valuable. Entity extraction, provided it is done fast, can be one of such tools. Such techniques usually rely on a prior training phase involving large datasets. That training is costly, specially in environments which are increasingly moving towards real time scenarios where latency to retrieve fresh informacion should be minimal. In this paper an `on-the-fly' query decomposition method is proposed. It uses snippets which are mined by means of a na\"ive statistical algorithm. An initial evaluation of such a method is provided, in addition to a discussion on its applicability to different scenarios."
search-engines  natural-language-processing  algorithms  nudge-targets  text-mining 
june 2010 by Vaguery
Thingology (LibraryThing's ideas blog): Google goes after the Library of Congress for "mature content"
"I have accordingly been consulting with Casey on how to remove all the butt-shots from the Yale University MARC records."
Google  censorship  LibraryThing  filtering  natural-language-processing  FAIL 
august 2008 by Vaguery

Copy this bookmark:



description:


tags: