earth2marsh + hadoop 4
Pig and Hive at Yahoo! · Yahoo! Hadoop Blog
january 2012 by earth2marsh
"The data preparation phase is often known as ETL (Extract Transform Load) or the data factory. "Factory" is a good analogy because it captures the essence of what is being done in this stage: Just as a physical factory brings in raw materials and outputs products ready for consumers, so a data factory brings in raw data and produces data sets ready for data users to consume. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. Users in this phase are generally engineers, data specialists, or researchers.
The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. In this phase, users may be engineers using the data for their systems, analysts, or decisionmakers.
Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse."
comparison
hadoop
pig
hive
data
factory
The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. In this phase, users may be engineers using the data for their systems, analysts, or decisionmakers.
Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse."
january 2012 by earth2marsh
Comparing Pig Latin and SQL for Constructing Data Processing Pipelines · Yahoo! Hadoop Blog
january 2012 by earth2marsh
"SQL's ubiquity is convenient. However, I believe that Pig Latin is a more natural choice for constructing data pipelines, for several reasons:
Pig Latin is procedural, where SQL is declarative.
Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
Pig Latin supports splits in the pipeline.
Pig Latin allows developers to insert their own code almost anywhere in the data pipeline."
hadoop
mapreduce
sql
pig
comparison
Pig Latin is procedural, where SQL is declarative.
Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
Pig Latin supports splits in the pipeline.
Pig Latin allows developers to insert their own code almost anywhere in the data pipeline."
january 2012 by earth2marsh
Copy this bookmark: