cshalizi + information_retrieval   39

Non-Parametric Modeling of Partially Ranked Data
"Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method."
to:NB  statistics  machine_learning  categorical_data  ordinal_data  information_retrieval  nonparametrics  lebanon.guy 
february 2012 by cshalizi
The structure of science information (Harris, 2002)
"The organization of information within science can be investigated in a principled way through analysis of science language. The restricted use of language in science enables description of the informational structure of science and of particular subfields, with strong similarities to structures in mathematics and programming languages. This result rests on decades of research into the relation between form and content in language, based on an information-theoretic approach to the structure of information. Examples are provided from immunology and the social sciences. Practical applications include storage of science information in databases, indexing the literature, and identification and resolution of controversy."
to:NB  linguistics  text_mining  natural_language_processing  harris.zellig  information_retrieval 
december 2011 by cshalizi
The Fans Are All Right (Pinboard Blog)
"I learned a lot about fandom couple of years ago in conversations with my friend Britta, who was working at the time as community manager for Delicious. She taught me that fans were among the heaviest users of the bookmarking site, and had constructed an edifice of incredibly elaborate tagging conventions, plugins, and scripts to organize their output along a bewildering number of dimensions. If you wanted to read a 3000 word fic where Picard forces Gandalf into sexual bondage, and it seems unconsensual but secretly both want it, and it's R-explicit but not NC-17 explicit, all you had to do was search along the appropriate combination of tags (and if you couldn't find it, someone would probably write it for you). By 2008 a whole suite of theoretical ideas about folksonomy, crowdsourcing, faceted infomation retrieval, collaborative editing and emergent ontology had been implemented by a bunch of friendly people so that they could read about Kirk drilling Spock." --- See also the very last link.
fandom  social_life_of_the_mind  social_media  information_retrieval  tagging  pinboard  delicious.com  via:arsyed  to_teach:data-mining  ok_maybe_not_really_to_teach 
october 2011 by cshalizi
Draw - Google Correlate
So cool: draw a curve free-hand, get the keywords whose time series correlate best with it.  I can't go below a correlation of 0.70.
google  information_retrieval  spurious_correlations  to_teach:undergrad-ADA  to_teach:data-mining  to:blog  via:vqv  rademacher_complexity 
october 2011 by cshalizi
Bayesian Checking for Topic Models
"Real document collections do not fit the inde- pendence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved."
topic_models  model-checking  blei.david  in_NB  via:ariddell  statistics  machine_learning  information_retrieval  clustering  have_read 
july 2011 by cshalizi
Predicting consumer behavior with Web search — PNAS
What search can and cannot predict. They mention, but I think could have stressed even more, that the search data is generated _automatically_ as a by-product of now-ordinary social life, rather than a deliberate construction on the part of public or private data-collecting agencies, so it is very, very, very cheap.
internet  data_mining  to_teach:data-mining  kith_and_kin  watts.duncan  hofman.jake  sociology  information_retrieval  networked_life  have_read 
october 2010 by cshalizi
[1010.0499] Statistical analysis of $k$-nearest neighbor collaborative recommendation
"Collaborative recommendation is an information-filtering technique that attempts to present information items that are likely of interest to an Internet user. Traditionally, collaborative systems deal with situations with two types of variables, users and items. In its most common form, the problem is framed as trying to estimate ratings for items that have not yet been consumed by a user. Despite wide-ranging literature, little is known about the statistical properties of recommendation systems. In fact, no clear probabilistic model even exists which would allow us to precisely describe the mathematical forces driving collaborative filtering. ... [We] set out a general sequential stochastic model for collaborative recommendation. ... in-depth analysis of the so-called cosine-type nearest neighbor ,,, method .... asymptotic performance as the number of users grows. We establish consistency ... under mild assumptions... Rates of convergence and examples ..."
collaborative_filtering  information_retrieval  stochastic_models  nearest_neighbors  to_teach:data-mining 
october 2010 by cshalizi
ILI 2009 Presentation – "Self-plagiarism is style"
Cool effects achieved by applying basic data mining to libraries. To be used as teaching fodder, but honestly I should also find the time to suggest it to our librarians.
libraries  data_mining  information_retrieval  collaborative_filtering  via:magistra_et_mater  to_teach:data-mining 
june 2010 by cshalizi
[0910.2340] A Stochastic Model for Collaborative Recommendation
"Collaborative recommendation is an information-filtering technique that attempts to present ,,, movies, music, books, news, images, Web pages, etc. that are likely of interest to [users]. ... In its most common form, the problem is framed as trying to estimate ratings for items that have not yet been consumed by a user. Despite wide-ranging literature, little is known about the statistical properties of recommendation systems. In fact, no clear probabilistic model even exists allowing us to precisely describe the mathematical forces driving collaborative filtering. To provide an initial contribution to this, we propose to set out a general sequential stochastic model for collaborative recommendation and analyze its asymptotic performance as the number of users grows.... analysis of the so-called cosine-type nearest neighbor collaborative method .... consistency of the procedure under mild assumptions on the model. Rates of convergence and examples..."
collaborative_filtering  information_retrieval  data_mining  to_read  to:NB  to_teach:data-mining 
october 2009 by cshalizi
Ton's Interdependent Thoughts: WolframAlpha, Getting Less Impressed Upon Closer Look
Nice: "For all its coolness on the front of WolframAlpha, on the back end this sounds like it's the mechanical turk of the semantic web."`
information_retrieval  wolfram.stephen  wolfram_alpha  via:arthegall 
may 2009 by cshalizi
About XStructure
Interface to arxiv via some kind of hierarchical clustering of the citation graph. (Can't find details.) Interesting but doesn't look all that useful (yet).
community_discovery  hierarchical_structure  information_retrieval  arxiv 
july 2008 by cshalizi
Desperately seeking the consumer: Personalized search engines and the commercial exploitation of user data: Rohle
" Essentially, search engines now fulfill the task of translating information needs into consumption needs."
information_retrieval 
march 2008 by cshalizi
Workshop I: Dynamic Searches and Knowledge Building
IPAM workshop on the mathematics of search and knowledge discovery, with links to slides and/or audio for some talks
information_retrieval  machine_learning  data_mining  linguistics  natural_language_processing  via:klk  semantics_from_syntax 
november 2007 by cshalizi
Michael Nielsen » Information Aggregators
"Where are the programming languages that have Bayesian filters, PageRank, and other types of collective intelligence as a central, core part of the language? I don’t mean libraries or plugines, I mean integrated into the core of the language in the sam
collaborative_filtering  information_retrieval  social_media  the_web  cognitive_triage 
november 2007 by cshalizi

related tags

academia  algorithms  arthegall  arxiv  behaviorism  bibliography  bioinformatics  blei.david  blogs  books  categorical_data  citation_networks  classifiers  clustering  cognitive_triage  collaborative_filtering  community_discovery  computational_statistics  computer_networks_as_provinces_of_the_commonwealth_of_letters  darnton.robert  databases  data_analysis  data_mining  data_sets  delicious.com  distributed_systems  document_summarization  early_visions_of_network_society  email  encyclopedias  enlightenment  enron  fandom  fraud  funny:geeky  good_old_fashioned_ai  google  graphical_models  harris.zellig  have_read  hierarchical_structure  history_of_intellect  hofman.jake  hofmann.thomas  hypothesis_testing  image_retrieval  information_retrieval  internet  in_NB  jones.rosie  journalism  kith_and_kin  klinkner.kristina  latent_semantic_analysis  lebanon.guy  lenat.douglas  libraries  linguistics  lolcats  lolfoxes  machine_learning  markov_models  meaning_as_location_in_a_system_of_relations  model-checking  natural_history_of_truthiness  natural_language_processing  nearest_neighbors  networked_life  networks  newspapers  nonparametrics  ok_maybe_not_really_to_teach  ordinal_data  page_rank  pattern_discovery  pinboard  precision-recall  rademacher_complexity  radev.dragomir  reinforcement_learning  research  scientific_computing  search_engines  semantics_from_syntax  skinner.b.f.  social_life_of_the_mind  social_media  sociology  spurious_correlations  statistics  sterling.bruce  stochastic_models  tagging  text_mining  theoretical_computer_science  thermodynamic_formalism  the_mechanical_turk_of_the_semantic_web  the_present_before_it_was_widely_distributed  the_web  to:blog  to:NB  topic_models  to_read  to_teach:data-mining  to_teach:undergrad-ADA  via:ariddell  via:arsyed  via:arthegall  via:chl  via:idlethink  via:klk  via:magistra_et_mater  via:myl  via:vqv  watts.duncan  wells.h.g.  why_oh_why_cant_we_have_a_better_academic_publishing_system  why_oh_why_cant_we_have_a_better_press_corps  wolfram.stephen  wolfram_alpha 

Copy this bookmark:



description:


tags: