Tables/Partitions can be further subdivided into Clusters or Buckets.
Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.
For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table.
Bucketing is helpful for 2 reasons
•Enables more efficient queries
•Makes sampling more efficient
•Hash(column) MOD(number of buckets) –evenly distributed
CREATE TABLE students_bucket(name STRING,id INT,college STRING) PARTITIONED BY(country STRING) CLUSTERED BY (college) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ ;
INSERT OVERWRITE TABLE employees_bucketPARTITION(country=‘USA’) SELECT name,id,college from college.students_external;
SELECT * FROM students_bucket TABLESAMPLE(BUCKET 2 OUT OF 4);