Using Internet Data for Economic Research
20 days ago by cshalizi
"The data used by economists can be broadly divided into two categories. First, structured datasets arise when a government agency, trade association, or company can justify the expense of assembling records. The Internet has transformed how economists interact with these datasets by lowering the cost of storing, updating, distributing, finding, and retrieving this information. Second, some economic researchers affirmatively collect data of interest. For researcher-collected data, the Internet opens exceptional possibilities both by increasing the amount of information available for researchers to gather and by lowering researchers' costs of collecting information. In this paper, I explore the Internet's new datasets, present methods for harnessing their wealth, and survey a sampling of the research questions these data help to answer. The first section of this paper discusses "scraping" the Internet for data—that is, collecting data on prices, quantities, and key characteristics that are already available on websites but not yet organized in a form useful for economic research. A second part of the paper considers online experiments, including experiments that the economic researcher observes but does not control (for example, when Amazon or eBay alters site design or bidding rules); and experiments in which a researcher participates in design, including those conducted in partnership with a company or website, and online versions of laboratory experiments. Finally, I discuss certain limits to this type of data collection, including both "terms of use" restrictions on websites and concerns about privacy and confidentiality."
to:NB
economics
data_sets
web
re:your_favorite_dsge_sucks
20 days ago by cshalizi
The Electronic Text Corpus of Sumerian Literature
6 weeks ago by cshalizi
"Sumerian is the first language for which we have written evidence and its literature the earliest known. The Electronic Text Corpus of Sumerian Literature (ETCSL), a project of the University of Oxford, comprises a selection of nearly 400 literary compositions recorded on sources which come from ancient Mesopotamia (modern Iraq) and date to the late third and early second millennia BCE.
"The corpus contains Sumerian texts in transliteration, English prose translations and bibliographical information for each composition. The transliterations and the translations can be searched, browsed and read online using the tools of the website."
(Re to_teach:data_mining tag: here are some bags of words for classification, principal components, topic models, maybe even manifold learning...)
sumer
mesopotamia
archaeology
history_of_ideas
data_sets
to_teach:data-mining
via:?
"The corpus contains Sumerian texts in transliteration, English prose translations and bibliographical information for each composition. The transliterations and the translations can be searched, browsed and read online using the tools of the website."
(Re to_teach:data_mining tag: here are some bags of words for classification, principal components, topic models, maybe even manifold learning...)
6 weeks ago by cshalizi
The World Top Incomes Database - G-MonD, PSE-Paris School of Economics
october 2011 by cshalizi
Possible computational project: code up estimating a Pareto tail for income (all sources) from these statistics, and tracking evolution over time (and perhaps across countries).
Or, an ADA project, suggested by conversation with John B.: look for correlation between (lack of) progressive taxation and job creation, as predicted by the usual right-wing suspects.
inequality
economics
data_sets
to_teach:undergrad-ADA
to_teach:statcomp
Or, an ADA project, suggested by conversation with John B.: look for correlation between (lack of) progressive taxation and job creation, as predicted by the usual right-wing suspects.
october 2011 by cshalizi
The Meta-Activism Project | A Non-Traditional Digital Activism Think Tank
september 2011 by cshalizi
Flagged "to_teach:data-mining" if I can think of a good project for students with this.
networked_life
politics
data_sets
to_teach:data-mining
september 2011 by cshalizi
Western on Strikes
april 2011 by cshalizi
Missing the union density variable. Wrote to ask about it. Referenced paper is http://www.jstor.org/stable/271022, which seems to me exactly the kind of thing Andy and I should mention in "Philosophy and Practice". --- ETA: Prof. Western wrote back within hours with the union density data, but I'm not sure I can make it public...
to_teach:undergrad-ADA
strikes
data_sets
april 2011 by cshalizi
BEA : Gross Domestic Product by Metropolitan Area
december 2010 by cshalizi
For the "urban scaling? what urban scaling" post. Thought: make this into a data analysis exercise in 402?
data_sets
cities
economics
urban_economics
to_teach:undergrad-ADA
december 2010 by cshalizi
US Census Spatial and Demographic Data in R: The UScensus2000 Suite of Packages
december 2010 by cshalizi
"The US Decennial Census is arguably the most important data set for social science research in the United States. The UScensus2000 suite of packages allows for convenient handling of the 2000 US Census spatial and demographic data. The goal of this article is to showcase the UScensus2000 suite of packages for R, to describe the data contained within these packages, and to demonstrate the helper functions provided for handling this data. The UScensus2000 suite is comprised of spatial and demographic data for the 50 states and Washington DC at four different geographic levels (block, block group, tract, and census designated place). The UScensus2000 suite also contains a number of functions for selecting and aggregating specific geographies or demographic information such as metropolitan statistical areas, counties, etc. ... This article will provide the necessary background for working with this data set, helper functions, and finish with an applied spatial statistics example."
data_sets
census
R
to_teach:undergrad-ADA
december 2010 by cshalizi
Night & Day
august 2010 by cshalizi
"Urban remote sensing", in part to estimate urban population aggregations w/o reference to administrative districts
cities
urbanism
data_sets
via:aaron_clauset
august 2010 by cshalizi
Make Research Data Public?—Not Always so Simple: A Dialogue for Statisticians and Science Editors
august 2010 by cshalizi
Nothing very profound or surprising, sadly.
statistics
social_life_of_the_mind
data_sets
august 2010 by cshalizi
http://lib.stat.cmu.edu/datasets/sleep
february 2010 by cshalizi
"Correlates of sleep in mammals" data set; to use in 490 for illustrating factor analysis.
data_sets
sleep
to_teach:undergrad-research
to_teach:data-mining
to_teach:undergrad-ADA
february 2010 by cshalizi
Using R for Cross-Cultural Research (Dow)
november 2009 by cshalizi
Describes working with the standard cross-cultural sample in R. TODO: track down the actual file! TODO: think about devising suitable examples/problems for data mining.
anthropology
R
data_sets
via:nikete
to_teach:data-mining
track_down_references
to_teach:undergrad-ADA
november 2009 by cshalizi
http://www.amstat.org/publications/jse/datasets/04cars.txt
september 2009 by cshalizi
2004 cars and trucks data.
data_sets
to_teach:data-mining
september 2009 by cshalizi
LDC Catalog: New York Times Annotated Corpus
august 2009 by cshalizi
Sounds like it would be perfect for 350. Now how the **** do I get access?
information_retrieval
text_mining
newspapers
data_sets
to_teach:data-mining
via:myl
august 2009 by cshalizi
Christmas Bird Count
april 2009 by cshalizi
Use this as an example of mixture modeling? Unfortunately there doesn't seem to be a good option to just download a set of all the counts from all years. Perhaps write to them to see if they'd make such a thing available?
birds
to_teach:data-mining
data_sets
via:myl
april 2009 by cshalizi
Rich Puchalsky's blog: eGRID
november 2008 by cshalizi
Notes on EPA's eGRID data source, on electrical power-plant emissions
data_sets
egrid
pollution
electric_power_grid
to_teach:data-mining
puchalsky.rich
november 2008 by cshalizi
FEC Campaign Contribution Data, 1980--2006
november 2008 by cshalizi
Re-parsed from the FEC files by Mary McGlohon.
campaign_finance
data_sets
mcglohon.mary
november 2008 by cshalizi
Pace & Berry's California Housing Data - StatLib
october 2008 by cshalizi
Data set on median California housing prices by census block (1990 census but possibly not the year of the housing prices --- 1997 paper), with eight continuous covariates. Some obviously weird records (e.g. block number 19007 has 19 total rooms, 5 bedrooms, 6 households, and a population of 7460) but no missing values.
to_teach:data-mining
spatial_statistics
data_sets
to_teach:undergrad-ADA
october 2008 by cshalizi
Statistics Data Sets
november 2007 by cshalizi
compilation listing at UMass Amherst, organized by relevant method. OK but not outstanding.
statistics
to_teach
data_sets
to_teach:undergrad-ADA
november 2007 by cshalizi
related tags
anthropology ⊕ archaeology ⊕ birds ⊕ campaign_finance ⊕ census ⊕ cities ⊕ climate_change ⊕ climatology ⊕ corporations ⊕ criminal_conspiracies ⊕ data_sets ⊖ development_economics ⊕ economics ⊕ economic_growth ⊕ EEG ⊕ egrid ⊕ electric_power_grid ⊕ enron ⊕ fraud ⊕ historiography ⊕ history_of_ideas ⊕ inequality ⊕ information_retrieval ⊕ magistra ⊕ mcglohon.mary ⊕ mcpherson.miller ⊕ medieval_european_history ⊕ mesopotamia ⊕ networked_life ⊕ networks ⊕ network_data_analysis ⊕ neuroscience ⊕ newspapers ⊕ no_really_via:warrenellis ⊕ occupy_wall_street ⊕ politics ⊕ pollution ⊕ puchalsky.rich ⊕ R ⊕ re:growing_ensemble_project ⊕ re:your_favorite_dsge_sucks ⊕ sleep ⊕ social_life_of_the_mind ⊕ social_networks ⊕ sociology ⊕ spatial_statistics ⊕ statistics ⊕ strikes ⊕ sumer ⊕ text_mining ⊕ time_series ⊕ to:blog ⊕ to:NB ⊕ to_teach ⊕ to_teach:complexity-and-inference ⊕ to_teach:data-mining ⊕ to_teach:statcomp ⊕ to_teach:undergrad-ADA ⊕ to_teach:undergrad-research ⊕ track_down_references ⊕ urbanism ⊕ urban_economics ⊕ via:? ⊕ via:aaron_clauset ⊕ via:myl ⊕ via:nikete ⊕ via:warrenellis ⊕ web ⊕ world_bank ⊕Copy this bookmark: