Welcome to Chukwa!
november 2011 by donturn
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
hadoop
monitor
logs
opensource
november 2011 by donturn
Scribe - GitHub
september 2011 by donturn
Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn’t available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed filesystem, or send them to another layer of scribe servers.
Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the “store” abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
opensource
logs
datascience
Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the “store” abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
september 2011 by donturn
Apache Flume
september 2011 by donturn
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
opensource
apache
logs
datascience
kdd
data
september 2011 by donturn
riivo/pwum - GitHub
may 2011 by donturn
pwum is a set of python scripts for working with web log files and extracting frequent patterns and clustering sessions.
Two main functions:
Finding frequent patters. Extract frequently co-accessed pages in web sessions. Uses traditonal frequent pattern mining algorithm Apriori. For more information on the implementation, please see here
Finding similar sessions based on behaviour,i.e, visited pages by clustering. Available methods are based on building Markov chain like transition matrix out of session and clustering these or representing sessions as simple feature vectors. Clustering currently done by k-means algorithm.
python
logs
analysis
web
analytics
Two main functions:
Finding frequent patters. Extract frequently co-accessed pages in web sessions. Uses traditonal frequent pattern mining algorithm Apriori. For more information on the implementation, please see here
Finding similar sessions based on behaviour,i.e, visited pages by clustering. Available methods are based on building Markov chain like transition matrix out of session and clustering these or representing sessions as simple feature vectors. Clustering currently done by k-means algorithm.
may 2011 by donturn
Analytics - the Data Liberation Front
april 2011 by donturn
extract google analytics data for extra analysis on your own (in Excel, python, etc)
google
analytics
logs
april 2011 by donturn
Google Search History Expands, Becomes Web History
april 2007 by donturn
needs a MUCH better interface.
google
history
web
search
identity
gui
ui
logs
logging
toolbar
information_seeking
april 2007 by donturn
Dejal - Simon
december 2006 by donturn
gui for monitoring web servers
mac
server
web
monitor
logs
analytics
december 2006 by donturn
related tags
alexa ⊕ analysis ⊕ analytics ⊕ apache ⊕ apps ⊕ behavior ⊕ browsers ⊕ business ⊕ cfp ⊕ conference ⊕ data ⊕ database ⊕ datamining ⊕ datascience ⊕ data_mart ⊕ data_mining ⊕ data_warehouse ⊕ enterprise ⊕ firefox ⊕ google ⊕ graphics ⊕ greasemonkey ⊕ gui ⊕ hadoop ⊕ hci ⊕ history ⊕ identity ⊕ information_seeking ⊕ instrumentation ⊕ intranet ⊕ ir ⊕ iseek ⊕ javascript ⊕ kdd ⊕ km ⊕ kms ⊕ logging ⊕ logs ⊖ mac ⊕ metrics ⊕ microsoft ⊕ monitor ⊕ mozilla ⊕ opensource ⊕ python ⊕ quant ⊕ quantia ⊕ queries ⊕ research ⊕ search ⊕ seo ⊕ server ⊕ social_computing ⊕ software ⊕ sql ⊕ statistics ⊕ stats ⊕ tla ⊕ toolbar ⊕ ui ⊕ usability ⊕ usage ⊕ web ⊕ webtracker ⊕ www ⊕ xml ⊕Copy this bookmark: