close

How do I add a new column to a Spark DataFrame (using PySpark)?

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How do I add a new column to a Spark DataFrame (using PySpark) in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

Table of Contents

How do I add a new column to a Spark DataFrame (using PySpark)?

  1. How do I add a new column to a Spark DataFrame (using PySpark)?

    For Spark 2.0
    assumes schema has 'age' column
    df.select('*', (df.age + 10).alias('agePlusTen'))

  2. add a new column to a Spark DataFrame (using PySpark)

    For Spark 2.0
    assumes schema has 'age' column
    df.select('*', (df.age + 10).alias('agePlusTen'))

Method 1

To add a column using a UDF:

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def valueToCategory(value):
   if   value == 1: return 'cat1'
   elif value == 2: return 'cat2'
   ...
   else: return 'n/a'

# NOTE: it seems that calls to udf() must be after SparkContext() is called
udfValueToCategory = udf(valueToCategory, StringType())
df_with_cat = df.withColumn("category", udfValueToCategory("x1"))
df_with_cat.show()

## +---+---+-----+---------+
## | x1| x2|   x3| category|
## +---+---+-----+---------+
## |  1|  a| 23.0|     cat1|
## |  3|  B|-23.0|      n/a|
## +---+---+-----+---------+

Method 2

For Spark 2.0

# assumes schema has 'age' column 
df.select('*', (df.age + 10).alias('agePlusTen'))

Conclusion

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read