How
Yahoo uses PIG?
Pig
is best suited for the data factory.
Data Factory contains:
Pipelines:
Pipelines
bring logs from Yahoo!'s web servers. These logs undergo a cleaning step where
bots, company internal views, and clicks are removed.
Research:
Researchers
want to quickly write a script to test a theory.
Pig
integration with streaming makes it easy for researchers to take a Perl or
Python script and run it against a huge data set.
Use
Case in Healthcare:
Personal
health information of a person in healthcare data is confidential and is not supposed to expose
to others. There is a need to mask this information.The data associated with
healthcare is huge,so identifying the personal health information and removing
it is crucial.
Problem Statement:
De-identify
personal health information.
Challenges:
Huge
amount of data flows into the systems daily and there are multiple data sources
that we need to aggregate data from.
Crunching
this huge data and de-identifying it in a traditional way had problems.
To
de-identify health information,pig can be used.
Sqoop
helps to export/import from relational database to HDFS and vice versa.
Take
a database dump into HDFS using sqoop and then de-identify columns using pig
script.Then store the de-identified data back into HDFS.