In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise. We can say that DataFrames are relational databases with better optimization techniques.

Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.

Why DataFrames?

When Apache Spark 1.3 was launched, it came with a new API called DataFrames that resolved the limitations of performance and scaling that occur while using RDDs.

When there is not much storage space in memory or on disk, RDDs do not function properly as they get exhausted. Besides, Spark RDDs do not have the concept of schema—the structure of a database that defines the objects of it. RDDs store both structured and unstructured data together, which is not very efficient.

RDDs cannot modify the system in such a way that it runs more efficiently. RDDs do not allow us to debug errors during the runtime. They store the data as a collection of Java objects.

RDDs use serialization (converting an object into a stream of bytes to allow faster processing) and garbage collection (an automatic memory management technique that detects unused objects and frees them from memory) techniques. This increases the overhead on the memory of the system as they are very lengthy.

This was when Spark DataFrames were introduced to overcome the limitations Spark RDDs had. Now, what makes Spark DataFrames so unique? Let’s check out the features of Spark DataFrames that make them so popular.

Read about Spark from Spark and Scala training in Bangalore and be a master in Apache Spark!

Features of DataFrames

Some of the unique features of DataFrames are:

  • Use of Input Optimization Engine: DataFrames make use of the input optimization engines, e.g., Catalyst Optimizer, to process data efficiently. We can use the same engine for all Python, Java, Scala, and R DataFrame APIs.
  • Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some meaning to it when it is being stored.
  • Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.
  • Flexibility: DataFrames, like RDDs, can support various formats of data, such as CSV, Cassandra, etc.
  • Scalability: DataFrames can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.

Wish to learn Apache Spark in detail? Read this extensive Spark Tutorial!

Creating DataFrames

There are many ways to create DataFrames. Here are three of the most commonly used methods to create DataFrames:

  • Creating DataFrames from JSON Files

Now, what are JSON files?

JSON, or JavaScript Object Notation, is a type of file that stores simple data structure objects in the .json format. It is mainly used to transmit data between Web servers. This is how a simple .json file looks like:

Creating Dataframes
The above JSON is a simple employee database file that contains two records/rows.

When it comes to Spark, the .json files that are being loaded are not the typical .json files. We cannot load a normal JSON file into a DataFrame. The JSON file that we want to load should be in the format given below:

What are JSON files
JSON files can be loaded onto DataFrames using the read.JSON function, with the file name we want to upload it.

  • Example:

Here, we are loading an Olympic medal count sheet onto a DataFrame. There are 10 fields in total. The function printSchema() prints the schema of the DataFrame.

What are JSON file

Get familiar with the most asked Spark Interview Questions and Answers to kick-start your career!

  • Creating DataFrames from the Existing RDDs

DataFrames can also be created from the existing RDDs. First, we create an RDD and then load that RDD onto a DataFrame using the createDataFrame(Name_of_the_rdd_file) function.

Know more from the Spark DataFrame blog.