mcroydon + hadoop   128

Hortonworks | Architecting the future of big data
Lots of good content coming from this Yahoo! spinoff.
yahoo  hadoop 
september 2011 by mcroydon
AccumuloProposal - Incubator Wiki
NoSQL with some properties similar to HBase with some interesting per-cel ACL. Born at the NSA.
apache  hadoop  nosql  nsa 
september 2011 by mcroydon
NextGen MapReduce Hits Apache Hadoop Mainline | Hortonworks
Favorite bulletpoint: "NextGen MapReduce has nearly 100,000 lines of code (roughly – just the *.java files). That’s nearly 1/3 of current Apache Hadoop codebase we’ve added in the last 12 months!" All SLOC jokes aside, it sounds like an awesome development.
hadoop  java  sloc 
august 2011 by mcroydon
Spark Cluster Computing Framework
"Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write."
analytics  hadoop  data  scala 
july 2011 by mcroydon
HPCC Systems | Open-source. Fast. Scalable. Simple.
"...a massive parallel-processing computing platform that solves Big Data problems." From Lexis-Nexis.
bigdata  hadoop  opensource  tools 
june 2011 by mcroydon
riptano/brisk - GitHub
A Cassandra-backed HDFS implementation and Hive driver.
cassandra  hadoop  hive  datastax 
may 2011 by mcroydon
The dark side of Hadoop - BackType Technology
These are the kinds of things that you don't find out until you've been knee deep in something for awhile.
hadoop  apache  java  mapreduce  map-reduce 
april 2011 by mcroydon
Brisk – Apache Hadoop™ powered by Cassandra | DataStax
HDFS-like storage layer for Hadoop/Hive using Cassandra.
hadoop  cassandra  hive 
march 2011 by mcroydon
OpenTSDB - A Distributed, Scalable Monitoring System
OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.
analysis  architecture  bigdata  cloud  data  database  db  java  lgpl  hbase  hadoop  development  graph  distributed  monitoring  nosql  opensource  operations  scalability  scale  time  sysadmin  software  storage  series  opentsdb  rrd  stumbleupon  time-series  timeseries 
november 2010 by mcroydon
s4: distributed stream computing platform
"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data."
apache  bigdata  cloud  cloudcomputing  cluster  computing  mapreduce  map  java  hadoop  framework  distributed  data  opensource  processing  platform  programming  real-time  streaming  stream  software  scalability  reduce  realtime  streamprocessing  yahoo  tool  s4  streams 
november 2010 by mcroydon
SHARD: Storing and Querying Large-Scale SemWeb Data
An excellent slide deck presenting SHARD at HadoopWorld.
rdf  triplestore  hadoop  hdfs  mapreduce  lubm 
november 2010 by mcroydon
SHARD Triple-Store
"SHARD is a proof-of-concept use of high-performance, low-cost distributed computing technology to develop a highly scalable triple-store. SHARD is released as an open-source project on the BSD license."
database  db  cloud  distributed  hadoop  lubm  mapreduce  rdf  store  sparql  storage  shard  semweb  semanticweb  scalability  triple-store 
october 2010 by mcroydon
Lineland
Scroll through for lots and lots of HBase internals.
blog  distributed  hadoop  hbase  nosql  mapreduce  programming  systems  storage  reference 
march 2010 by mcroydon
Why Europe’s Largest Ad Targeting Platform Uses Hadoop « Cloudera » Apache Hadoop for the Enterprise
Moving from Postgres to HDFS + Pig and MapReduce for large data storage, analysis, and aggregation.
clojure  data  cloud  database  development  hadoop  mapreduce  web  nosql 
march 2010 by mcroydon
Hw09 Counting And Clustering And Other Data Tricks
"Large scale computing is transformative for NYTimes.com."
hadoop  nytimes  data  analysis 
november 2009 by mcroydon
Lineland: Hive vs. Pig
Different tools for different jobs, but it's hard choosing sometimes when you're in the Hadoop ecosystem.
database  hadoop  mapreduce  comparison  hive  pig 
november 2009 by mcroydon
SourceForge.net: pydoop
Python C++ wrappers for HDFS and MapReduce. It's probably quicker than Dumbo.
python  code  library  hadoop  c++  analytics  project  examples  hdfs  via:pskomoroch 
november 2009 by mcroydon
Hbase/Stargate - Hadoop Wiki
Alpha-quality RESTful interface for HBase. Includes plain text, JSON, XML, and ProtocolBuffer serializers.
rest  hadoop  hbase  xml  json  web-services 
november 2009 by mcroydon
Avro: a Format for Big Data » Cloudera Hadoop & Big Data Blog
Another data interchange format (I think) like ProtocolBuffers and Thrift. I think one of the bigger problems that the Hadoop/big data community has is parallel internal implementations of building blocks that are later open-sourced.
data  database  storage  distributed  hadoop  apache  cloud  json  messaging  encoding  protocol  portable  cloudera  bigdata  data-structures  serialization  format  foss  thrift  buffers  introduction  avro 
november 2009 by mcroydon
Journal of Eivind Uggedal: NoSQL East 2009 - Summary of Day 1
Some interesting bits and more of the same but I really like the dark-launch approach that Scribe allows.
data  database  toread  blog  scalability  internet  distributed  article  hadoop  scaling  db  cloud  couchdb  conference  papers  keyvalue  nosql  links  cassandra  2009  mongodb  dynomite  riak 
november 2009 by mcroydon
GoodDoop
A nice set of recipes for Hadoop that probably translate well to other Map/Reduce architectures.
wiki  algorithms  algorithm  hadoop  mapreduce  examples  recipe 
october 2009 by mcroydon
Analyzing Human Genomes with Hadoop » Cloudera Hadoop & Big Data Blog
A fantastic writeup of absurdly fast sequencing software that can analyze a human genome in about 3 hours for less than $100 of AWS resources. Pretty darned impressive.
data  opensource  computer  amazon  algorithms  aws  hadoop  ec2  mapreduce  dna  bioinformatics  cloudera  trend  genetics  genome  foss  genomics 
october 2009 by mcroydon
Training to Climb an Everest of Digital Data
Big data is big and almost always requires a completely different mindset than the one that is taught in computer science programs.
data  database  processing  google  news  toread  ibm  energy  datasets  mining  search  research  science  internet  algorithms  storage  scaling  education  hadoop  analysis  computer-science  datacuration 
october 2009 by mcroydon
NAACL/HLT 2009 Tutorial: Data-Intensive Text Processing with MapReduce
"This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model (Dean and Ghemawat, 2004), using the open-source Hadoop implementation."
tutorial  hadoop  graph  slides  mapreduce  nlp  machine_learning  textmining  via:pskomoroch 
august 2009 by mcroydon
Debugging MapReduce Programs With MRUnit
Testing Java Hadoop just got a little easier.
java  hadoop  testing 
july 2009 by mcroydon
Coding Horror: Scaling Up vs. Scaling Out: Hidden Costs
Food for thought with the caveat that scaling out is a lot easier if you don't have any per-server software costs. Big iron costs less to operate though.
programming  hardware  business  server  scalability  coding  networking  architecture  performance  distributed  scaling  web-development  cluster  hadoop  hosting  clustering  comparison  servers  distribution  it  codinghorror  2009  stackoverflow 
june 2009 by mcroydon
HBase Goes Realtime
"We improved our performance by more than an order of magnitude in most cases"
slides  pdf  hadoop  hbase  performance 
june 2009 by mcroydon
Steve: Developing on the Edge - the Yahoo! Hadoop distro
I see Yahoo and Cloudera's distributions of Hadoop a lot like Ububtu vs. Debian where mainline hadooop is Debian stability and these distributions are the Ubuntu compromise for new features.
yahoo  cloudera  hadoop  map  reduce  map-reduce 
june 2009 by mcroydon
Neo4j - a Graph Database that Kicks Buttox | High Scalability
The most common complaint about existing graph databases is performance. Hopefully a stable of good, performant graph databases will change that.
data  database  toread  visualization  java  opensource  network  scalability  cool  architecture  performance  graph  hadoop  databases  db  graphs  2009  arch  socialnetworking  socialmedia  dataviz  neo4j  graph_database  graph-database  relationship 
june 2009 by mcroydon
« earlier      

related tags

@toread  aa  academia  ad  admin  ai  algorithm  algorithms  alternative  amazing  amazon  analysis  analytics  apache  api  appengine  application  apt  arch  architecture  article  articles  avro  aws  backtype  backup  bash  bashreduce  batch  benchmark  benchmarks  berkeley  big  bigdata  bigtable  bioinformatics  blog  bloom  bloom-filter  bloomfilter  book  books  buffers  business  c++  cache  caching  cacti  caffeine  cap  cascading  cassandra  census  class  click  clojure  cloud  cloud-computing  cloudcomputing  cloudera  cloudkick  cluster  clustering  clusters  code  coding  codinghorror  collection  colossus  comment  community  compare  comparison  compress  compression  compsci  computer  computer-science  computers  computerscience  computing  concurrency  conference  conferences  configuration  consistency  cool  couchdb  course  cs  ctypes  cython  data  data-mining  data-structures  data-warehousing  database  databases  datacuration  dataflow  datamining  dataprocessing  datasets  datastax  datastore  datastructure  datastructures  dataviz  datawarehouse  data_mining  db  dbms  demo  design  desktop  dev  develop  developers  developerworks  development  dht  differences  digg  dist  distributed  distributed-computing  distributedcomputing  distribution  dna  doin-it-wrong  draft  dw  dynomite  ebooks  ebs  ec2  education  elastic  elasticmapreduce  emr  encoding  energy  engineering  english  erlang  event  example  examples  facebook  file  filesystem  filter  filters  flockdb  format  foss  framework  frameworks  free  freebase  functional  future  gae  geek  genetics  genome  genomics  geo  gfs  gfs2  gis  good  google  graph  graph-database  graphd  graphdb  graphics  graphs  graph_database  grid  gui  hack  hadoop  hadoopdb  hadoopworld  happy  hardware  hashing  hbase  hdfs  hive  hop  hosting  howto  hpc  hypertable  ibm  implementation  important  imported  index  indexing  info  information-retrieval  infrastructure  install  interesting  internet  introduction  ir  it  jabber  java  javascript  jaylinks  jdbc  json  jvm  jython  katta  key-value  keyvalue  knowledge  kvs  kvstore  last.fm  learning  lesen  lgpl  lib  library  linkedin  links  linux  list  log  logs  london  lubm  lucene  mac  machinelearning  machine_learning  mahout  mail  management  map  map-reduce  mapreduce  maryland  merge  messages  messaging  metadata  metaweb  microformats  mining  mit  ml  moa  mongodb  monitor  monitoring  mrjob  multicore  mysql  neo4j  netflix  network  networking  networks  new  news  nlp  node  node.js  nokia  nosql  nsa  nyc  nytimes  online  ontology  open-source  opensource  opentsdb  open_source  operations  ops  optimization  os  overview  package  packaging  pagerank  papers  parallel  patterns  pdf  performance  pig  pipe  platform  portable  post  postgres  postgresql  presentation  presentations  processing  programming  project  protocol  python  query  rails  rdbms  rdf  read  reading  real-time  realtime  recipe  recommendation  recommendations  reddit  reduce  reference  regression  relational  relationaldb  relationship  replication  reporting  repository  research  resource  resources  rest  riak  rpm  rrd  rsync  ruby  rubyonrails  s3  s4  saas  samples  scala  scalability  scale  scaling  science  scribe  script  scripting  search  searchengine  semantic  semanticweb  semantic_web  semweb  seo  serialization  series  server  servers  service  shard  sharding  shell  similarity  slide  slides  slideshare  sloc  small  socialmedia  socialnetworking  socialnetworks  software  solr  source  sparklines  sparql  spatial  sql  stackoverflow  statistics  stats  storage  store  stream  streaming  streamprocessing  streams  stumbleupon  sysadmin  systems  t  tech  technology  testing  text  text-mining  textmining  thrift  time  time-series  timeseries  tips  to-read  todo  tool  tools  toread  tracking  trend  trendingtopics  trends  triple  triple-store  triplestore  tuple  tuples  tuplespace  tutorial  twitter  ui  uk  unix  unread  usergroup  via:pskomoroch  video  videos  visualization  vldb  voldemort  vs  warehouse  web  web-development  web-services  webdev  webservice  webservices  weka  wiki  wikipedia  work  world  xml  xmpp  yahoo  yale  yam  yelp  zippy  zookeeper 

Copy this bookmark:



description:


tags: