Breaking News

Pig Introduction


Pig is an open source high level dataflow system.
It provides a simple language called pig latin which is used for querying and data manipulation. This pig latin script is compiled into mapreduce jobs that are run on hadoop.

Why PIG was created?
An ad-hoc way of creating and executing map-reduce jobs on very large data sets. Its useful for rapid development. The developer need not have to know java programming.
It’s developed by Yahoo.

Why Should I Go For Pig When There Is MR?
ü      Powerful model for parallelism.
ü      Based on a rigid procedural structure.
ü      Provides a good opportunity to parallelize algorithm.
ü      Does not require developer to have any Java skills
ü      It is desirable to have a higher level declarative language.
ü      Similar to SQL query where the user specifies the “what” and leaves the “how” to the underlying processing engine.

Who are using PIG ?
70% MapReduce jobs are written in Pig in Yahoo
ü      Web log processing
ü      Build user behavior models
ü      Process images
ü      Data mining
Twitter, LinkedIn, Ebay, AOL, etc.

Where I Should Use Pig?

SamplingPig is a data flow language. It is at the top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently.

Case 1 –Time Sensitive Data Loads
Case 2 –Processing Many Data Sources
Case 3 –Analytic Insight Through Sampling Pig

Where not to use PIG?
Really nasty data formats or completely unstructured data(video, audio, raw human-readable text).
Pig is definitely slow compared to Map Reduce jobs.
When you would like more power to optimize your code.

Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?

Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want  to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of  Bangalore city  similar to Noida through which I can bring the population data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’,  you create the alias name which will be the common name for these two cities so that  you create a common key for both the cities and it get passed to the same reducer. For this, we have to write  custom partitioner.

In mapreduce when you create a ‘key’ for city,  you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida then pass similar hashcode.  However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.
- See more at:
Toggle Footer