Spark SQL - Inferring the Schema using Reflection - Spark SQL Programming

What is Spark SQL -Inferring the Schema using Reflection?

This method uses reflection to produce the schema of an RDD that includes definite sort of objects. The Scala interface for Spark SQL supports automatically exchanging an RDD containing case classes to a DataFrame. The case class describes the schema of the table. The names of the opinions to the case class are examined using reflection and they become the names of the columns.

Case classes can also be nested or contain complex types such as series or Arrays. This RDD can be absolutely be converted to a DataFrame and then registered as a table. Tables can be used in successive SQL statements.


Let us consider an model of employee records in a text file named employee.txt. Create an RDD by reading the data from text file and change it into DataFrame by means of Default SQL functions.
Given Data − Take a look into the following data of a file named employee.txt placed it in the present respective directory where the spark shell point is running.
The following examples clarify on how to generate a schema using Reflections.

Start the Spark Shell

Start the Spark Shell using following command.

Create SQLContext

Generate SQLContext using the following command. Here, sc means SparkContext object.

Import SQL Functions

Use the resulting command to bring in all the SQL functions used to completely convert an RDD to a DataFrame.

Create Case Class

Next, we have to describe a schema for employee record data by means of a case class. The following command is used to declare the case class based on the given data (id, name, age).

Create RDD and Apply Transformations

Use the subsequent command to generate an RDD named empl by reading the data from employee.txt and converting it into DataFrame, using the Map functions.
Here, two map functions are defined. One is for splitting the text record into fields (.map(_.split(“,”))) and the second map function for converting individual fields (id, name, age) into one case class object (.map(e(0).trim.toInt, e(1), e(2).trim.toInt)).
At last, toDF() process is used for converting the case class object with schema into a DataFrame.


Store the DataFrame Data in a Table

Use the subsequent command to store the DataFrame data into a table named employee. After this command, we can apply all types of SQL statements into it.
The employee table is ready. Let us now pass some sql queries on the table using SQLContext.sql() method.

Select Query on DataFrame

Use the subsequent command to select all the records from the employee table. Here, we use the variable allrecords for confining all records data. To display those records, call show() method on it.
To see the result data of allrecords DataFrame, use the subsequent command.


Where Clause SQL Query on DataFrame

Use the subsequent command for apply where statement in a table. Here, the variable agefilter stores the records of employees whose age are between 20 and 35.
To see the result data of agefilter DataFrame, use the following command.


The earlier two doubts were passed against the whole table DataFrame. Now let us try to fetch data from the result DataFrame by applying Transformations on it.

Fetch ID values from agefilter DataFrame using column index

The subsequent statement is used for fetching the ID values from agefilter RDD result, using field index.


This indication based move towards leads to more succinct code and works well when you already know the representation while writing your Spark application.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

Spark SQL Programming Topics