pyspark dataframe | python spark dataframe with examples

  1. What is the pyspark dataframe?
  2. How pyspark dataframe is evaluated?
  3. Which is the entry point for pyspark?
  4. How dataframes are created?
  5. Can we Specify the schema of data as well while creating the dataframe?
  6. Practical examples of how to create a dataframe using different data sources.
    1. Example: converting the list of numeric into a pyspark dataframe.
    2. Example: converting the tuple having employee details into a pyspark dataframe.
    3. Example: converting the dictionary having a list of countries and their capitals into a pyspark dataframe.
    4. Example: creating ‘pyspark.sql.rows’ object of number and their square and converting it into pyspark dataframe.
    5. Example: creating python pandas dataframe and then converting the pandas dataframe into pyspark dataframe.
    6. Example: converting the list of numbers into a pyspark dataframe by using pyspark RDD.

1. What is the pyspark dataframe?

A pyspark dataframe is equivalent to a table in a relational database, it is also equivalent to a data frame in Python/R. The pyspark dataframe comes with greater capabilities. We can also say that a dataframe is a dataset that is organized into named columns. You can create a pyspark dataframe from different sources like from python list/tuple/dictionary, using a relational database table, using existing RDD.

Jump to the top of the page↵

2. How pyspark dataframe is evaluated?

Pyspark dataframe is lazily evaluated. Lazy evaluation means the execution of transformations isn’t triggered until you call an action method that produces the results. The execution model of Spark relies on what is called lazy evaluation. In Spark, operations are generally broken up into transformations applied to data sets and actions intended to derive and produce a result from that series of transformations.

Jump to the top of the page↵

3. Which is the entry point for pyspark?

PySpark applications start with initializing SparkSession which is the entry point of PySpark. When we run the PySpark shell, the shell automatically creates the session in the variable spark for users.

Jump to the top of the page↵

4. How Dataframes are created?

A pyspark dataframe can be created by using ‘createDataFrame’ method, full name is ‘pyspark.sql.SparkSession.createDataFrame‘. This method takes the following objects as an argument to create the dataframe

  • Lists
  • Tuples
  • Dictionaries
  • pyspark.sql.rows
  • A pandas dataframe
  • Pyspark RDD

Jump to the top of the page↵

5. Can we Specify the schema of data as well while creating the dataframe?

Yes, we can specify the schema argument as well while dataframe creation, if we skip the schema argument, pyspark automatically infers the schema by taking the sample data.

Jump to the top of the page↵

6. Practical examples of how to create a dataframe using different data sources.

  1. Example of creating the pyspark dataframe using python lists.
  2. Example of creating the pyspark dataframe using python tuples.
  3. Example of creating the pyspark dataframe using python Dictionaries.
  4. Example of creating the pyspark dataframe using ‘pyspark.sql.rows‘.
  5. Example of creating the pyspark dataframe using a pandas dataframe.
  6. Example of creating the pyspark dataframe using Pyspark RDD.

Jump to the top of the page↵

6.1. Example of creating the pyspark dataframe using lists.

In the below example, we are converting the list of numeric into a pyspark dataframe.

#Import the packages
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#Below we are creating the spark session
spark = SparkSession.builder.getOrCreate()

#Defining the python list to be converted into pyspark dataframe
var_list = [[1],[2],[3],[4],[5],[6]]
print("list value are : ",var_list)

#Below we are converting the list into pyspark rdd using parallelize method 
rdd = sc.parallelize(var_list)

#Defining the schema of data
schema = StructType([StructField('number',IntegerType(),True)])
rdd.collect()

#Finally creating the dataframe using python list and the data schema
df = spark.createDataFrame(rdd,schema)
print("Below are the values from dataframe")
print(df.head(10)

6.1. OUTPUT:

list value are :  [[1], [2], [3], [4], [5], [6]]
Below are the values from dataframe
[Row(number=1), Row(number=2), Row(number=3), Row(number=4), Row(number=5), Row(number=6)]

Jump to the top of the page↵

6.2. Example of creating the pyspark dataframe using tuples.

In the below example, we are converting the tuple having employee details into a pyspark dataframe.

#Import the package
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#Below we are creating the spark session
spark = SparkSession.builder.getOrCreate()

#Defining the python list to be converted into pyspark dataframe
var_tuple = [(101,'Prem','delhi'),(102,'Anshuman','Noida'),(103,'Sarvesh','New AN')]
print("Tuple value")
for t in var_tuple:
    print(t)

#Below we are converting the tuple into pyspark rdd using parallelize method 
rdd = sc.parallelize(var_tuple)

#Defining the schema of data
data_schema = StructType([StructField('emp_id',IntegerType(),False),StructField('emp_name',StringType(),True),StructField('emp_city',StringType(),True)])
rdd.collect()

print("\n")
#Finally creating the dataframe using python tuple and the data schema
df = spark.createDataFrame(rdd,data_schema)
print("Below are the values from dataframe")
print(df.head(10))

6.2. OUTPUT:

Tuple value
(101, 'Prem', 'delhi')
(102, 'Anshuman', 'Noida')
(103, 'Sarvesh', 'New AN')

Below are the values from dataframe
[Row(emp_id=101, emp_name='Prem', emp_city='delhi'), Row(emp_id=102, emp_name='Anshuman', emp_city='Noida'), Row(emp_id=103, emp_name='Sarvesh', emp_city='New AN')]

Jump to the top of the page↵

6.3. Example of creating the pyspark dataframe using python dictionary.

In the below example, we are converting a python dictionary that contains the list of countries and their capitals into a pyspark dataframe.

#Import the package
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#Below we are creating the spark session
spark = SparkSession.builder.getOrCreate()

#Defining the python dictionary to be converted into pyspark dataframe
var_dictionary = [{"Country":"India","Capital":"New Delhi"},{"Country":"Japan","Capital":"Tokyo"},{"Country":"Russia","Capital":"Moscow"},{"Country":"United States","Capital":"Washington D.C."},{"Country":"United Kingdom","Capital":"London"}]
print("Dictionary value")
for d in var_dictionary:
    print(d)

#Below we are converting the dictionary into pyspark dataframe
df = sc.parallelize(var_dictionary).toDF()
print("\n")
print("Below are the values from dataframe")
print(df.head(10))

6.3. OUTPUT:

Dictionary value
{'Country': 'India', 'Capital': 'New Delhi'}
{'Country': 'Japan', 'Capital': 'Tokyo'}
{'Country': 'Russia', 'Capital': 'Moscow'}
{'Country': 'United States', 'Capital': 'Washington D.C.'}
{'Country': 'United Kingdom', 'Capital': 'London'}

Below are the values from dataframe
[Row(Capital='New Delhi', Country='India'), Row(Capital='Tokyo', Country='Japan'), Row(Capital='Moscow', Country='Russia'), Row(Capital='Washington D.C.', Country='United States'), Row(Capital='London', Country='United Kingdom')]

Jump to the top of the page↵

6.4. Example of creating the pyspark dataframe using ‘pyspark.sql.rows’.

In the below example, we are creating ‘pyspark.sql.rows’ object of number and their square and then converting it into pyspark dataframe.

#Import the package
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row

list_of_row = []
for i in range(2,21):
    list_of_row = list_of_row + [Row(num=i,square=i*i)]
    print("num:",i,"square:",i*i)

#Below we are creating the spark session
spark = SparkSession.builder.getOrCreate()

#Below we are creating the pyspark dataframe
df = spark.createDataFrame(list_of_row)
print("\n")
print("Below are the values from dataframe")
print(df.head(10))

6.4. OUTPUT:

num: 2 square: 4
num: 3 square: 9
num: 4 square: 16
num: 5 square: 25
num: 6 square: 36
num: 7 square: 49
num: 8 square: 64
num: 9 square: 81
num: 10 square: 100
num: 11 square: 121
num: 12 square: 144
num: 13 square: 169
num: 14 square: 196
num: 15 square: 225
num: 16 square: 256
num: 17 square: 289
num: 18 square: 324
num: 19 square: 361
num: 20 square: 400

Below are the values from dataframe
[Row(num=2, square=4), Row(num=3, square=9), Row(num=4, square=16), Row(num=5, square=25), Row(num=6, square=36), Row(num=7, square=49), Row(num=8, square=64), Row(num=9, square=81), Row(num=10, square=100), Row(num=11, square=121)]

Jump to the top of the page↵

6.5. Example of creating the pyspark dataframe using a pandas dataframe.

In the below example, we are creating a python pandas dataframe and then converting the pandas dataframe into a pyspark dataframe.

#Import the package
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row

import pandas as pd    
list_emp = [['Prem', 30], ['Akhilesh', 35], ['Murari', 32],['Sanjay',36]] 
# Create the pandas DataFrame 
pandas_df = pd.DataFrame(list_emp, columns = ['Name', 'Age']) 

print(pandas_df)

#Below we are creating the spark session
spark = SparkSession.builder.getOrCreate()

#Below we are creating the pyspark dataframe
df = spark.createDataFrame(pandas_df)
print("\n")
print("Below are the values from dataframe")
print(df.head(10))

6.5. OUTPUT:

       Name  Age
0      Prem   30
1  Akhilesh   35
2    Murari   32
3    Sanjay   36

Below are the values from dataframe
[Row(Name='Prem', Age=30), Row(Name='Akhilesh', Age=35), Row(Name='Murari', Age=32), Row(Name='Sanjay', Age=36)]

Jump to the top of the page↵

6.6. Example of creating the pyspark dataframe using Pyspark RDD.

In the below example, we are converting the list of numbers into a pyspark dataframe by using pyspark RDD.

#Import the package
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#Below we are creating the spark session
spark = SparkSession.builder.getOrCreate()

#Defining the python list to be converted into pyspark dataframe
var_list = [[1],[2],[3],[4],[5],[6]]
print("list value are : ",var_list)

#Below we are converting the list into pyspark rdd using parallelize method 
rdd = sc.parallelize(var_list)

#Defining the schema of data
schema = StructType([StructField('number',IntegerType(),True)])
rdd.collect()

#Finally creating the dataframe using python list and the data schema via RDD
df = spark.createDataFrame(rdd,schema)
print("Below are the values from dataframe")
print(df.head(6))

6.6. OUTPUT:

list value are :  [[1], [2], [3], [4], [5], [6]]
Below are the values from dataframe
[Row(number=1), Row(number=2), Row(number=3), Row(number=4), Row(number=5), Row(number=6)]

Jump to the top of the page↵

Please go to the official user guide for details.

Below is our previous articles on the spark

7 thoughts on “pyspark dataframe | python spark dataframe with examples”

  1. דירה דיסקרטית

    I must thank you for the efforts youve put in writing this site. I really hope to check out the same high-grade content from you in the future as well. In fact, your creative writing abilities has inspired me to get my very own site now 😉

Leave a Comment

Your email address will not be published. Required fields are marked *