Breaking News
Loading...

Bucketing in Hive





Buckets(or Clusters):
Tables/Partitions can be further subdivided into Clusters or Buckets.
Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.
 For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table.
Bucketing is helpful for 2 reasons
•Enables more efficient queries
•Makes sampling more efficient
•Hash(column) MOD(number of buckets) –evenly distributed

Code:
CREATE TABLE students_bucket(name STRING,id INT,college STRING) PARTITIONED BY(country STRING) CLUSTERED BY (college) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ ;

set hive.enforce.bucketing=true;

INSERTING
INSERT OVERWRITE TABLE employees_bucketPARTITION(country=‘USA’) SELECT name,id,college from college.students_external;

SAMPLING
SELECT * FROM students_bucket TABLESAMPLE(BUCKET 2 OUT OF 4);
- See more at: http://labstrikes.blogspot.in/2012/08/adsense-middle-blog-post.html#sthash.gQgSkqx8.dpuf
 
Toggle Footer