Breaking News
Loading...

Pig Usecases





How Yahoo uses PIG?
Pig is best suited for the data factory.
Data Factory contains:
Pipelines:
Pipelines bring logs from Yahoo!'s web servers. These logs undergo a cleaning step where bots, company internal views, and clicks are removed.
Research:
Researchers want to quickly write a script to test a theory.
Pig integration with streaming makes it easy for researchers to take a Perl or Python script and run it against a huge data set.

Use Case in Healthcare:
Personal health information of a person in healthcare data  is confidential and is not supposed to expose to others. There is a need to mask this information.The data associated with healthcare is huge,so identifying the personal health information and removing it is crucial.
Problem Statement:
De-identify personal health information.
Challenges:
Huge amount of data flows into the systems daily and there are multiple data sources that we need to aggregate data from.
Crunching this huge data and de-identifying it in a traditional way had problems.


To de-identify health information,pig can be used.
Sqoop helps to export/import from relational database to HDFS and vice versa.
Take a database dump into HDFS using sqoop and then de-identify columns using pig script.Then store the de-identified data back into HDFS.
 

- See more at: http://labstrikes.blogspot.in/2012/08/adsense-middle-blog-post.html#sthash.gQgSkqx8.dpuf
 
Toggle Footer