- What is the pyspark dataframe?
- How pyspark dataframe is evaluated?
- Which is the entry point for pyspark?
- How dataframes are created?
- Can we Specify the schema of data as well while creating the dataframe?
- Practical examples of how to create a dataframe using different data sources.
- Example: converting the list of numeric into a pyspark dataframe.
- Example: converting the tuple having employee details into a pyspark dataframe.
- Example: converting the dictionary having a list of countries and their capitals into a pyspark dataframe.
- Example: creating ‘pyspark.sql.rows’ object of number and their square and converting it into pyspark dataframe.
- Example: creating python pandas dataframe and then converting the pandas dataframe into pyspark dataframe.
- Example: converting the list of numbers into a pyspark dataframe by using pyspark RDD.
1. What is the pyspark dataframe?
A pyspark dataframe is equivalent to a table in a relational database, it is also equivalent to a data frame in Python/R. The pyspark dataframe comes with greater capabilities. We can also say that a dataframe is a dataset that is organized into named columns. You can create a pyspark dataframe from different sources like from python list/tuple/dictionary, using a relational database table, using existing RDD.
2. How pyspark dataframe is evaluated?
Pyspark dataframe is lazily evaluated. Lazy evaluation means the execution of transformations isn’t triggered until you call an action method that produces the results. The execution model of Spark relies on what is called lazy evaluation. In Spark, operations are generally broken up into transformations applied to data sets and actions intended to derive and produce a result from that series of transformations.
3. Which is the entry point for pyspark?
PySpark applications start with initializing SparkSession which is the entry point of PySpark. When we run the PySpark shell, the shell automatically creates the session in the variable spark for users.
4. How Dataframes are created?
A pyspark dataframe can be created by using ‘createDataFrame’ method, full name is ‘pyspark.sql.SparkSession.createDataFrame‘. This method takes the following objects as an argument to create the dataframe
- Lists
- Tuples
- Dictionaries
- pyspark.sql.rows
- A pandas dataframe
- Pyspark RDD
5. Can we Specify the schema of data as well while creating the dataframe?
Yes, we can specify the schema argument as well while dataframe creation, if we skip the schema argument, pyspark automatically infers the schema by taking the sample data.
6. Practical examples of how to create a dataframe using different data sources.
- Example of creating the pyspark dataframe using python lists.
- Example of creating the pyspark dataframe using python tuples.
- Example of creating the pyspark dataframe using python Dictionaries.
- Example of creating the pyspark dataframe using ‘pyspark.sql.rows‘.
- Example of creating the pyspark dataframe using a pandas dataframe.
- Example of creating the pyspark dataframe using Pyspark RDD.
6.1. Example of creating the pyspark dataframe using lists.
In the below example, we are converting the list of numeric into a pyspark dataframe.
#Import the packages from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType #Below we are creating the spark session spark = SparkSession.builder.getOrCreate() #Defining the python list to be converted into pyspark dataframe var_list = [[1],[2],[3],[4],[5],[6]] print("list value are : ",var_list) #Below we are converting the list into pyspark rdd using parallelize method rdd = sc.parallelize(var_list) #Defining the schema of data schema = StructType([StructField('number',IntegerType(),True)]) rdd.collect() #Finally creating the dataframe using python list and the data schema df = spark.createDataFrame(rdd,schema) print("Below are the values from dataframe") print(df.head(10)
6.1. OUTPUT:
list value are : [[1], [2], [3], [4], [5], [6]] Below are the values from dataframe [Row(number=1), Row(number=2), Row(number=3), Row(number=4), Row(number=5), Row(number=6)]
6.2. Example of creating the pyspark dataframe using tuples.
In the below example, we are converting the tuple having employee details into a pyspark dataframe.
#Import the package from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType #Below we are creating the spark session spark = SparkSession.builder.getOrCreate() #Defining the python list to be converted into pyspark dataframe var_tuple = [(101,'Prem','delhi'),(102,'Anshuman','Noida'),(103,'Sarvesh','New AN')] print("Tuple value") for t in var_tuple: print(t) #Below we are converting the tuple into pyspark rdd using parallelize method rdd = sc.parallelize(var_tuple) #Defining the schema of data data_schema = StructType([StructField('emp_id',IntegerType(),False),StructField('emp_name',StringType(),True),StructField('emp_city',StringType(),True)]) rdd.collect() print("\n") #Finally creating the dataframe using python tuple and the data schema df = spark.createDataFrame(rdd,data_schema) print("Below are the values from dataframe") print(df.head(10))
6.2. OUTPUT:
Tuple value (101, 'Prem', 'delhi') (102, 'Anshuman', 'Noida') (103, 'Sarvesh', 'New AN') Below are the values from dataframe [Row(emp_id=101, emp_name='Prem', emp_city='delhi'), Row(emp_id=102, emp_name='Anshuman', emp_city='Noida'), Row(emp_id=103, emp_name='Sarvesh', emp_city='New AN')]
6.3. Example of creating the pyspark dataframe using python dictionary.
In the below example, we are converting a python dictionary that contains the list of countries and their capitals into a pyspark dataframe.
#Import the package from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType #Below we are creating the spark session spark = SparkSession.builder.getOrCreate() #Defining the python dictionary to be converted into pyspark dataframe var_dictionary = [{"Country":"India","Capital":"New Delhi"},{"Country":"Japan","Capital":"Tokyo"},{"Country":"Russia","Capital":"Moscow"},{"Country":"United States","Capital":"Washington D.C."},{"Country":"United Kingdom","Capital":"London"}] print("Dictionary value") for d in var_dictionary: print(d) #Below we are converting the dictionary into pyspark dataframe df = sc.parallelize(var_dictionary).toDF() print("\n") print("Below are the values from dataframe") print(df.head(10))
6.3. OUTPUT:
Dictionary value {'Country': 'India', 'Capital': 'New Delhi'} {'Country': 'Japan', 'Capital': 'Tokyo'} {'Country': 'Russia', 'Capital': 'Moscow'} {'Country': 'United States', 'Capital': 'Washington D.C.'} {'Country': 'United Kingdom', 'Capital': 'London'} Below are the values from dataframe [Row(Capital='New Delhi', Country='India'), Row(Capital='Tokyo', Country='Japan'), Row(Capital='Moscow', Country='Russia'), Row(Capital='Washington D.C.', Country='United States'), Row(Capital='London', Country='United Kingdom')]
6.4. Example of creating the pyspark dataframe using ‘pyspark.sql.rows’.
In the below example, we are creating ‘pyspark.sql.rows’ object of number and their square and then converting it into pyspark dataframe.
#Import the package from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql import Row list_of_row = [] for i in range(2,21): list_of_row = list_of_row + [Row(num=i,square=i*i)] print("num:",i,"square:",i*i) #Below we are creating the spark session spark = SparkSession.builder.getOrCreate() #Below we are creating the pyspark dataframe df = spark.createDataFrame(list_of_row) print("\n") print("Below are the values from dataframe") print(df.head(10))
6.4. OUTPUT:
num: 2 square: 4 num: 3 square: 9 num: 4 square: 16 num: 5 square: 25 num: 6 square: 36 num: 7 square: 49 num: 8 square: 64 num: 9 square: 81 num: 10 square: 100 num: 11 square: 121 num: 12 square: 144 num: 13 square: 169 num: 14 square: 196 num: 15 square: 225 num: 16 square: 256 num: 17 square: 289 num: 18 square: 324 num: 19 square: 361 num: 20 square: 400 Below are the values from dataframe [Row(num=2, square=4), Row(num=3, square=9), Row(num=4, square=16), Row(num=5, square=25), Row(num=6, square=36), Row(num=7, square=49), Row(num=8, square=64), Row(num=9, square=81), Row(num=10, square=100), Row(num=11, square=121)]
6.5. Example of creating the pyspark dataframe using a pandas dataframe.
In the below example, we are creating a python pandas dataframe and then converting the pandas dataframe into a pyspark dataframe.
#Import the package from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql import Row import pandas as pd list_emp = [['Prem', 30], ['Akhilesh', 35], ['Murari', 32],['Sanjay',36]] # Create the pandas DataFrame pandas_df = pd.DataFrame(list_emp, columns = ['Name', 'Age']) print(pandas_df) #Below we are creating the spark session spark = SparkSession.builder.getOrCreate() #Below we are creating the pyspark dataframe df = spark.createDataFrame(pandas_df) print("\n") print("Below are the values from dataframe") print(df.head(10))
6.5. OUTPUT:
Name Age 0 Prem 30 1 Akhilesh 35 2 Murari 32 3 Sanjay 36 Below are the values from dataframe [Row(Name='Prem', Age=30), Row(Name='Akhilesh', Age=35), Row(Name='Murari', Age=32), Row(Name='Sanjay', Age=36)]
6.6. Example of creating the pyspark dataframe using Pyspark RDD.
In the below example, we are converting the list of numbers into a pyspark dataframe by using pyspark RDD.
#Import the package from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType #Below we are creating the spark session spark = SparkSession.builder.getOrCreate() #Defining the python list to be converted into pyspark dataframe var_list = [[1],[2],[3],[4],[5],[6]] print("list value are : ",var_list) #Below we are converting the list into pyspark rdd using parallelize method rdd = sc.parallelize(var_list) #Defining the schema of data schema = StructType([StructField('number',IntegerType(),True)]) rdd.collect() #Finally creating the dataframe using python list and the data schema via RDD df = spark.createDataFrame(rdd,schema) print("Below are the values from dataframe") print(df.head(6))
6.6. OUTPUT:
list value are : [[1], [2], [3], [4], [5], [6]] Below are the values from dataframe [Row(number=1), Row(number=2), Row(number=3), Row(number=4), Row(number=5), Row(number=6)]
Please go to the official user guide for details.
Below is our previous articles on the spark
pyspark dataframe vs pandas dataframe?
I must thank you for the efforts youve put in writing this site. I really hope to check out the same high-grade content from you in the future as well. In fact, your creative writing abilities has inspired me to get my very own site now 😉
Thank you
Artificial intelligence creates content for the site, no worse than a copywriter, you can also use it to write articles. 100% uniqueness
🙂
Good
Thank you.