cshalizi + data_sets   27

Using Internet Data for Economic Research
"The data used by economists can be broadly divided into two categories. First, structured datasets arise when a government agency, trade association, or company can justify the expense of assembling records. The Internet has transformed how economists interact with these datasets by lowering the cost of storing, updating, distributing, finding, and retrieving this information. Second, some economic researchers affirmatively collect data of interest. For researcher-collected data, the Internet opens exceptional possibilities both by increasing the amount of information available for researchers to gather and by lowering researchers' costs of collecting information. In this paper, I explore the Internet's new datasets, present methods for harnessing their wealth, and survey a sampling of the research questions these data help to answer. The first section of this paper discusses "scraping" the Internet for data—that is, collecting data on prices, quantities, and key characteristics that are already available on websites but not yet organized in a form useful for economic research. A second part of the paper considers online experiments, including experiments that the economic researcher observes but does not control (for example, when Amazon or eBay alters site design or bidding rules); and experiments in which a researcher participates in design, including those conducted in partnership with a company or website, and online versions of laboratory experiments. Finally, I discuss certain limits to this type of data collection, including both "terms of use" restrictions on websites and concerns about privacy and confidentiality."
to:NB  economics  data_sets  web  re:your_favorite_dsge_sucks 
20 days ago by cshalizi
The Electronic Text Corpus of Sumerian Literature
"Sumerian is the first language for which we have written evidence and its literature the earliest known. The Electronic Text Corpus of Sumerian Literature (ETCSL), a project of the University of Oxford, comprises a selection of nearly 400 literary compositions recorded on sources which come from ancient Mesopotamia (modern Iraq) and date to the late third and early second millennia BCE.
"The corpus contains Sumerian texts in transliteration, English prose translations and bibliographical information for each composition. The transliterations and the translations can be searched, browsed and read online using the tools of the website."

(Re to_teach:data_mining tag: here are some bags of words for classification, principal components, topic models, maybe even manifold learning...)
sumer  mesopotamia  archaeology  history_of_ideas  data_sets  to_teach:data-mining  via:? 
6 weeks ago by cshalizi
The World Top Incomes Database - G-MonD, PSE-Paris School of Economics
Possible computational project: code up estimating a Pareto tail for income (all sources) from these statistics, and tracking evolution over time (and perhaps across countries).

Or, an ADA project, suggested by conversation with John B.: look for correlation between (lack of) progressive taxation and job creation, as predicted by the usual right-wing suspects.
inequality  economics  data_sets  to_teach:undergrad-ADA  to_teach:statcomp 
october 2011 by cshalizi
The Meta-Activism Project | A Non-Traditional Digital Activism Think Tank
Flagged "to_teach:data-mining" if I can think of a good project for students with this.
networked_life  politics  data_sets  to_teach:data-mining 
september 2011 by cshalizi
Western on Strikes
Missing the union density variable.  Wrote to ask about it.  Referenced paper is http://www.jstor.org/stable/271022, which seems to me exactly the kind of thing Andy and I should mention in "Philosophy and Practice".  --- ETA: Prof. Western wrote back within hours with the union density data, but I'm not sure I can make it public...
to_teach:undergrad-ADA  strikes  data_sets 
april 2011 by cshalizi
BEA : Gross Domestic Product by Metropolitan Area
For the "urban scaling? what urban scaling" post.  Thought: make this into a data analysis exercise in 402?
data_sets  cities  economics  urban_economics  to_teach:undergrad-ADA 
december 2010 by cshalizi
US Census Spatial and Demographic Data in R: The UScensus2000 Suite of Packages
"The US Decennial Census is arguably the most important data set for social science research in the United States. The UScensus2000 suite of packages allows for convenient handling of the 2000 US Census spatial and demographic data. The goal of this article is to showcase the UScensus2000 suite of packages for R, to describe the data contained within these packages, and to demonstrate the helper functions provided for handling this data. The UScensus2000 suite is comprised of spatial and demographic data for the 50 states and Washington DC at four different geographic levels (block, block group, tract, and census designated place). The UScensus2000 suite also contains a number of functions for selecting and aggregating specific geographies or demographic information such as metropolitan statistical areas, counties, etc. ... This article will provide the necessary background for working with this data set, helper functions, and finish with an applied spatial statistics example."
data_sets  census  R  to_teach:undergrad-ADA 
december 2010 by cshalizi
Night & Day
"Urban remote sensing", in part to estimate urban population aggregations w/o reference to administrative districts
cities  urbanism  data_sets  via:aaron_clauset 
august 2010 by cshalizi
http://lib.stat.cmu.edu/datasets/sleep
"Correlates of sleep in mammals" data set; to use in 490 for illustrating factor analysis.
data_sets  sleep  to_teach:undergrad-research  to_teach:data-mining  to_teach:undergrad-ADA 
february 2010 by cshalizi
Using R for Cross-Cultural Research (Dow)
Describes working with the standard cross-cultural sample in R. TODO: track down the actual file! TODO: think about devising suitable examples/problems for data mining.
anthropology  R  data_sets  via:nikete  to_teach:data-mining  track_down_references  to_teach:undergrad-ADA 
november 2009 by cshalizi
Christmas Bird Count
Use this as an example of mixture modeling? Unfortunately there doesn't seem to be a good option to just download a set of all the counts from all years. Perhaps write to them to see if they'd make such a thing available?
birds  to_teach:data-mining  data_sets  via:myl 
april 2009 by cshalizi
Rich Puchalsky's blog: eGRID
Notes on EPA's eGRID data source, on electrical power-plant emissions
data_sets  egrid  pollution  electric_power_grid  to_teach:data-mining  puchalsky.rich 
november 2008 by cshalizi
Pace & Berry's California Housing Data - StatLib
Data set on median California housing prices by census block (1990 census but possibly not the year of the housing prices --- 1997 paper), with eight continuous covariates. Some obviously weird records (e.g. block number 19007 has 19 total rooms, 5 bedrooms, 6 households, and a population of 7460) but no missing values.
to_teach:data-mining  spatial_statistics  data_sets  to_teach:undergrad-ADA 
october 2008 by cshalizi
Statistics Data Sets
compilation listing at UMass Amherst, organized by relevant method.  OK but not outstanding.
statistics  to_teach  data_sets  to_teach:undergrad-ADA 
november 2007 by cshalizi

Copy this bookmark:



description:


tags: