How Yahoo uses PIG?
Pig is best suited for the data factory.
Data Factory contains:
Pipelines bring logs from Yahoo!'s web servers. These logs undergo a cleaning step where bots, company internal views, and clicks are removed.
Researchers want to quickly write a script to test a theory.
Pig integration with streaming makes it easy for researchers to take a Perl or Python script and run it against a huge data set.
Use Case in Healthcare:
Personal health information of a person in healthcare data is confidential and is not supposed to expose to others. There is a need to mask this information.The data associated with healthcare is huge,so identifying the personal health information and removing it is crucial.
De-identify personal health information.
Huge amount of data flows into the systems daily and there are multiple data sources that we need to aggregate data from.
Crunching this huge data and de-identifying it in a traditional way had problems.
To de-identify health information,pig can be used.
Sqoop helps to export/import from relational database to HDFS and vice versa.
Take a database dump into HDFS using sqoop and then de-identify columns using pig script.Then store the de-identified data back into HDFS.