donturn + datascience 43
Data-Intensive Text Processing with MapReduce
6 weeks ago by donturn
free ebook w/github supported edits - Data-Intensive Text Processing with MapReduce #datascience #text
datascience
github
text
research
mapreduce
data
6 weeks ago by donturn
Welcome to Apache Pig!
september 2011 by donturn
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
apache
data
hadoop
mapreduce
opensource
sql
database
datascience
september 2011 by donturn
Scribe - GitHub
september 2011 by donturn
Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn’t available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed filesystem, or send them to another layer of scribe servers.
Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the “store” abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
opensource
logs
datascience
Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the “store” abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
september 2011 by donturn
Apache Flume
september 2011 by donturn
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
opensource
apache
logs
datascience
kdd
data
september 2011 by donturn
Needlebase
february 2011 by donturn
merge data, crawl web data, then chart and explore it
api
data
database
datascience
kdd
etl
stats
charts
map
february 2011 by donturn
Chart Chooser – Juice Analytics
february 2011 by donturn
Chart Chooser lets you pick what you're trying to present & get powerpoint and excel templates #data
analytics
viz
charts
data
datascience
stats
#data
february 2011 by donturn
The 70 Online Databases that Define Our Planet - Technology Review
december 2010 by donturn
The 70 Online Databases That Define Our Planet - Technology Review
database
data
research
datascience
content
from twitter
december 2010 by donturn
google-refine - Project Hosting on Google Code
november 2010 by donturn
Google Refine looks like a great tool for cleaning & transforming messy data for use w/web services
analysis
data
datamining
google
tools
dev
datascience
from twitter
november 2010 by donturn
How do I become a data scientist? - Quora
october 2010 by donturn
great collection of links about stats and other tech
datascience
statistics
stats
research
rstats
from delicious
october 2010 by donturn
Data Mining and Applications Graduate Certificate | Stanford University Online
june 2010 by donturn
Stanford has a certificate program focusing on stats and analytics.
syllabi
datamining
kdd
analytics
strategy
datascience
search
sem
stats
june 2010 by donturn
City Forward
april 2010 by donturn
looks like more fun than SimCity.
data
ibm
metroia
ia
information_architecture
design
cities
planet
location
kdd
datascience
data_science
april 2010 by donturn
related tags
#data ⊕ academic ⊕ analysis ⊕ analytics ⊕ apache ⊕ api ⊕ apps ⊕ bi ⊕ blog ⊕ book ⊕ charts ⊕ cities ⊕ code ⊕ content ⊕ data ⊕ database ⊕ datamining ⊕ datascience ⊖ data_mining ⊕ data_science ⊕ design ⊕ dev ⊕ eclipse ⊕ economics ⊕ economist ⊕ editor ⊕ enterprise ⊕ etl ⊕ excel ⊕ finance ⊕ github ⊕ google ⊕ graph ⊕ gui ⊕ hadoop ⊕ humanities ⊕ ia ⊕ ibm ⊕ information_architecture ⊕ information_retrieval ⊕ intelligence ⊕ ip ⊕ ir ⊕ kdd ⊕ km ⊕ location ⊕ logs ⊕ machinelearning ⊕ map ⊕ mapreduce ⊕ math ⊕ metroia ⊕ microsoft ⊕ mis ⊕ nlp ⊕ opensource ⊕ patents ⊕ planet ⊕ python ⊕ quanit ⊕ quant ⊕ quantia ⊕ research ⊕ rstats ⊕ science ⊕ search ⊕ sem ⊕ software ⊕ sql ⊕ startup ⊕ statistics ⊕ stats ⊕ strategy ⊕ syllabi ⊕ tagging ⊕ text ⊕ textbook ⊕ tools ⊕ twitter ⊕ visualization ⊕ viz ⊕ web ⊕Copy this bookmark: