Waterbear: Welcome
2 days ago
Javascript playground for Waterbear allows you to create Waterbear scripts, see the Javascript it will generate, and run it right in the browser!
javascript
programming
tools
2 days ago
Data-Intensive Text Processing with MapReduce
6 weeks ago
free ebook w/github supported edits - Data-Intensive Text Processing with MapReduce #datascience #text
datascience
github
text
research
mapreduce
data
6 weeks ago
User Experience Design Guidelines for Windows Phone
february 2012
User Experience Design Guidelines for Windows Phone
design
metro
microsoft
mobile
ui
ia
february 2012
Free DOS games for Boxer, and tips on where to find more of them.
january 2012
would enjoy Zork again for sure.
mac
dos
games
january 2012
Official Google Blog: Search, plus Your World
january 2012
Finally, some new (or is it? http://www.google.com/press/pressrel/outride.html ) functionality in google's personalized search options.
google
search
personalization
patent
january 2012
missingmanuals.com - Mac OS X Lion: The Missing Manual CD
december 2011
links to the apps and sites in the Missing Manual book for Lion
mac
lion
december 2011
Coverage of ApacheCon North America 2011, 6th-11th November 2011 | Lanyrd
november 2011
audio & presentations from ApacheCon North America 2011
apache
opensource
lucene
solr
hadoop
november 2011
Apache OpenNLP - Welcome to Apache OpenNLP
november 2011
OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.
OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package
apache
java
nlp
opensource
OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package
november 2011
Apache Tika - Apache Tika
november 2011
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
apache
java
lucene
metadata
parser
november 2011
Apache UIMA - Apache UIMA
november 2011
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
apache
framework
java
nlp
opensource
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
november 2011
Welcome to Chukwa!
november 2011
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
hadoop
monitor
logs
opensource
november 2011
Welcome to Hive!
november 2011
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
data
database
hadoop
opensource
nosql
november 2011
Apache Rave
november 2011
Apache Rave is a new web and social mashup engine. It will provide an out-of-the-box as well as an extendible lightweight Java platform to host, serve and aggregate (Open)Social Gadgets and services through a highly customizable and Web 2.0 friendly front-end. Rave is targeted as engine for internet and intranet portals and as building block to provide context-aware personalization and collaboration features for multi-site/multi-channel (mobile) oriented and content driven websites and (social) network oriented services and platforms. For the OpenSocial container and services the (Java) Apache Shindig will be integrated. At a later stage further generalization is envisioned to also transparently support W3C Widgets using Apache Wookie.
apache
opensource
mashup
november 2011
Testing Benford's Law
november 2011
examining large datasets for Benford's Law
statistics
math
data
from twitter
november 2011
Yes, Computer Scientists Are Hypercritical | blog@CACM | Communications of the ACM
october 2011
we need to boost when examining recommendation items by CS people
cs
research
cf
recommender
october 2011
Welcome to Apache Pig!
september 2011
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
apache
data
hadoop
mapreduce
opensource
sql
database
datascience
september 2011
Scribe - GitHub
september 2011
Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn’t available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed filesystem, or send them to another layer of scribe servers.
Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the “store” abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
opensource
logs
datascience
Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the “store” abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
september 2011
Apache Flume
september 2011
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
opensource
apache
logs
datascience
kdd
data
september 2011
Apache Solr 3.1 Cookbook Book & eBook | Packt Publishing Technical & IT Book and eBook Store
september 2011
Apache Solr 3.1 Cookbook looks great
search
apache
opensource
lucene
solr
september 2011
academic
ads
advertising
ajax
amazon
analysis
analytics
android
apache
api
apple
applescript
apps
art
audio
austin
backup
baseball
behavior
bibliography
blocking
blog
blogging
blogs
book
bookmarks
books
browser
business
calendar
cf
city
classification
cli
code
coffee
collaboration
collaborative_filtering
community
conference
cscw
css
data
data_mining
database
datamining
datascience
design
desktop
desktop_search
dev
dvr
economics
editor
education
email
enterprise
extension
extensions
filtering
finance
firefox
flickr
folksonomy
food
freeware
fun
games
gaming
gmail
google
graphics
greasemonkey
gtd
gui
hacks
hardware
hci
hdtv
history
howto
ia
ical
information_architecture
information_retrieval
interface
internet
intranet
iphone
iphoto
ir
iseek
itunes
java
javascript
jazz
kdd
keynote
km
kms
language
leopard
linux
logging
logs
los_angeles
lucene
mac
mac_folklore
macdev
mail
management
maps
mashup
math
media
messaging
metadata
metrics
metroia
microsoft
mlb
mobile
movies
mozilla
mp3
music
networks
new_mexico
nyc
ontology
open_source
opensource
organization
osx
personalization
phone
photos
pictures
pim
pkm
plugin
powerpoint
presentation
privacy
productivity
programming
python
quant
quantia
quicksilver
rdf
recommender
reference
research
rss
rstats
safari
san_francisco
science
scifi
scripting
search
security
semantic_web
seo
sfo
social_computing
social_networks
social_software
software
spotlight
startup
statistics
stats
syllabi
syllabus
sync
tagging
tags
taxonomy
texas
text
tivo
todo
tools
travel
travel_tools
tv
twitter
ui
urban
usability
usb
utilities
ux
vancouver
video
vista
visualization
viz
web
web2
web_services
webdev
widget
wifi
wiki
windows
wireframe
wordpress
xml
xp
yahoo