Data-Intensive Text Processing with MapReduce
6 weeks ago by donturn
free ebook w/github supported edits - Data-Intensive Text Processing with MapReduce #datascience #text
datascience
github
text
research
mapreduce
data
6 weeks ago by donturn
Welcome to Hive!
november 2011 by donturn
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
data
database
hadoop
opensource
nosql
november 2011 by donturn
Testing Benford's Law
november 2011 by donturn
examining large datasets for Benford's Law
statistics
math
data
from twitter
november 2011 by donturn
Welcome to Apache Pig!
september 2011 by donturn
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
apache
data
hadoop
mapreduce
opensource
sql
database
datascience
september 2011 by donturn
Apache Flume
september 2011 by donturn
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
opensource
apache
logs
datascience
kdd
data
september 2011 by donturn
Inside the Recommendation Engines of StumbleUpon, YouTube, Pandora and Hotpot | Liz Gannes | NetworkEffect | AllThingsD
march 2011 by donturn
Inside the Recommendation Engines of StumbleUpon, YouTube, Pandora and Hotpot -by @lizgannes
cf
recommender
data
from twitter_favs
march 2011 by donturn
Data mining, forecasting and bioinformatics competitions on Kaggle
february 2011 by donturn
contests where winners are paid (not very much) to solve data oriented problems
datamining
data
outsourcing
stats
february 2011 by donturn
ggaughan/pipe2py - GitHub
february 2011 by donturn
convert yahoo pipes to python
opensource
python
yahoo
data
rss
atom
february 2011 by donturn
Needlebase
february 2011 by donturn
merge data, crawl web data, then chart and explore it
api
data
database
datascience
kdd
etl
stats
charts
map
february 2011 by donturn
Chart Chooser – Juice Analytics
february 2011 by donturn
Chart Chooser lets you pick what you're trying to present & get powerpoint and excel templates #data
analytics
viz
charts
data
datascience
stats
#data
february 2011 by donturn
The 70 Online Databases that Define Our Planet - Technology Review
december 2010 by donturn
The 70 Online Databases That Define Our Planet - Technology Review
database
data
research
datascience
content
from twitter
december 2010 by donturn
google-refine - Project Hosting on Google Code
november 2010 by donturn
Google Refine looks like a great tool for cleaning & transforming messy data for use w/web services
analysis
data
datamining
google
tools
dev
datascience
from twitter
november 2010 by donturn
comScore, Inc.
april 2010 by donturn
data collection, but murky methodology
quant
quantia
behavior
analytics
ratings
data
april 2010 by donturn
City Forward
april 2010 by donturn
looks like more fun than SimCity.
data
ibm
metroia
ia
information_architecture
design
cities
planet
location
kdd
datascience
data_science
april 2010 by donturn
cityofsound: The street as platform
march 2008 by donturn
The way the street feels may soon be defined by what cannot be seen with the naked eye.
mobile
design
kdd
data_mining
privacy
ubicomp
wireless
wifi
web
research
scifi
urbanism
networks
data
information
march 2008 by donturn
related tags
#data ⊕ academic ⊕ analysis ⊕ analytics ⊕ apache ⊕ api ⊕ atom ⊕ austin ⊕ backup ⊕ behavior ⊕ bi ⊕ bibliometrics ⊕ blog ⊕ blogs ⊕ cf ⊕ charts ⊕ cities ⊕ classification ⊕ cloud ⊕ code ⊕ communication ⊕ content ⊕ crawler ⊕ dashboard ⊕ data ⊖ database ⊕ datamining ⊕ datascience ⊕ data_mining ⊕ data_science ⊕ design ⊕ dev ⊕ dns ⊕ economist ⊕ empirical ⊕ etl ⊕ excel ⊕ extensions ⊕ finance ⊕ firefox ⊕ geo ⊕ github ⊕ google ⊕ graph ⊕ graphics ⊕ gui ⊕ hacks ⊕ hadoop ⊕ hci ⊕ history ⊕ ia ⊕ ibm ⊕ influence ⊕ information ⊕ information_architecture ⊕ intelligence ⊕ interface ⊕ internet ⊕ investing ⊕ ir ⊕ katta ⊕ kdd ⊕ kelvin ⊕ km ⊕ language ⊕ links ⊕ location ⊕ logs ⊕ lucene ⊕ mac ⊕ map ⊕ mapreduce ⊕ math ⊕ media ⊕ metroia ⊕ microsoft ⊕ mis ⊕ ml ⊕ mobile ⊕ mozilla ⊕ network ⊕ networks ⊕ nlp ⊕ nosql ⊕ olap ⊕ open ⊕ opensource ⊕ outsourcing ⊕ parsing ⊕ pim ⊕ pkm ⊕ planet ⊕ plugin ⊕ privacy ⊕ programming ⊕ public ⊕ python ⊕ qualitative ⊕ quanit ⊕ quant ⊕ quantia ⊕ quantitative ⊕ ranking ⊕ ratings ⊕ rdf ⊕ readability ⊕ recommender ⊕ regex ⊕ regression ⊕ reports ⊕ research ⊕ rss ⊕ rstats ⊕ rsync ⊕ science ⊕ scifi ⊕ scraper ⊕ search ⊕ security ⊕ sentiment ⊕ social ⊕ socialgraph ⊕ socialnetworks ⊕ social_computing ⊕ solr ⊕ spider ⊕ spreadsheet ⊕ sql ⊕ startup ⊕ statistics ⊕ stats ⊕ study ⊕ survey ⊕ sync ⊕ tagging ⊕ text ⊕ tools ⊕ trading ⊕ twitter ⊕ ubicomp ⊕ urbanism ⊕ ux ⊕ visualization ⊕ viz ⊕ web ⊕ wifi ⊕ wikipedia ⊕ windows ⊕ wireless ⊕ wordnet ⊕ yahoo ⊕ zip ⊕Copy this bookmark: