Apache Pig Group Operator - Apache Pig

What is the role of Group Operator in Apache pig?

The GROUP operator is used to group the data in one or more relations. It collects the data with the same key.


See the below mentioned syntax of the group operator.


Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
And we have loaded this file into Apache Pig with the relation name student_details as shown below.
Now, let us group the records/tuples in the relation by age as shown below.


Verify the relation group_data using the DUMP operator as shown below.


Let’s get the final output with the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns −
  • One is age, by which we have grouped the relation.
  • The other is a bag, which contains the group of tuples, student records with the respective age.
Here, you can find the schema of the table after grouping the data using the describe command as mentioned below.
Here, observe the sample illustration of the schema using the illustrate command as shown below.
It will produce the following output –

Grouping by Multiple Columns

Let’s group the relation by age and city as shown below.
You can verify the content of the relation named group_multiple using the Dump operator as shown below.

Group All

You can group a relation by all the columns as shown below.
Now, verify the content of the relation group_all as shown below.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Apache Pig Topics