close

[Solved] Pyspark: explode json in column to multiple columns

Hello Guys, How are you all? Hope You all Are Fine. Today I get the following error Pyspark: explode json in column to multiple columns in python. So Here I am Explain to you all the possible solutions here.

Without wasting your time, Let’s start This Article to Solve This Error.

How Pyspark: explode json in column to multiple columns Error Occurs?

Today I get the following error Pyspark: explode json in column to multiple columns in python.

How To Solve Pyspark: explode json in column to multiple columns Error ?

  1. How To Solve Pyspark: explode json in column to multiple columns Error ?

    To Solve Pyspark: explode json in column to multiple columns Error As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema

  2. Pyspark: explode json in column to multiple columns

    To Solve Pyspark: explode json in column to multiple columns Error As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema

Solution 1

As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType(
    [
        StructField('key1', StringType(), True),
        StructField('key2', StringType(), True)
    ]
)

df.withColumn("data", from_json("data", schema))\
    .select(col('id'), col('point'), col('data.*'))\
    .show()

which should give you

+---+-----+----+----+
| id|point|key1|key2|
+---+-----+----+----+
|abc|    6| 124| 345|
|df1|    7| 777| 888|
|4bd|    6| 111| 788|
+---+-----+----+----+

Solution 2

the data field is a string field. since the keys are the same (i.e. ‘key1’, ‘key2’) in the JSON string over rows, you might also use json_tuple() (this function is New in version 1.6 based on the documentation)

from pyspark.sql import functions as F

df.select('id', 'point', F.json_tuple('data', 'key1', 'key2').alias('key1', 'key2')).show()

Below is My original post: which is most likely WRONG if the original table is from df.show(truncate=False) and thus the data field is NOT a python data structure.

Since you have exploded the data into rows, I supposed the column data is a Python data structure instead of a string:

from pyspark.sql import functions as F

df.select('id', 'point', F.col('data').getItem('key1').alias('key1'), F.col('data')['key2'].alias('key2')).show()

Summery

It’s all About this issue. Hope all solution helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which solution worked for you? Thank You.

Also, Read