close

[Solved] Pyspark: Parse a column of json strings

Hello Guys, How are you all? Hope You all Are Fine. Today I get the following error Pyspark: Parse a column of json strings in python. So Here I am Explain to you all the possible solutions here.

Without wasting your time, Let’s start This Article to Solve This Error.

How Pyspark: Parse a column of json strings Error Occurs?

Today I get the following error Pyspark: Parse a column of json strings in python.

How To Solve Pyspark: Parse a column of json strings Error ?

  1. How To Solve Pyspark: Parse a column of json strings Error ?

    To Solve Pyspark: Parse a column of json strings Error Converting a dataframe with json strings to structured dataframe is'a actually quite simple in spark if you convert the dataframe to RDD of strings before

  2. Pyspark: Parse a column of json strings

    To Solve Pyspark: Parse a column of json strings Error Converting a dataframe with json strings to structured dataframe is'a actually quite simple in spark if you convert the dataframe to RDD of strings before

Solution 1

Converting a dataframe with json strings to structured dataframe is’a actually quite simple in spark if you convert the dataframe to RDD of strings before

For example:

>>> new_df = sql_context.read.json(df.rdd.map(lambda r: r.json))
>>> new_df.printSchema()
root
 |-- body: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- sub_json: struct (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- sub_sub_json: struct (nullable = true)
 |    |    |    |-- col1: long (nullable = true)
 |    |    |    |-- col2: string (nullable = true)
 |-- header: struct (nullable = true)
 |    |-- foo: string (nullable = true)
 |    |-- id: long (nullable = true)

Solution 2

For Spark 2.1+, you can use from_json which allows the preservation of the other non-json columns within the dataframe as follows:

from pyspark.sql.functions import from_json, col
json_schema = spark.read.json(df.rdd.map(lambda row: row.json)).schema
df.withColumn('json', from_json(col('json'), json_schema))

You let Spark derive the schema of the json string column. Then the df.json column is no longer a StringType, but the correctly decoded json structure, i.e., nested StrucType and all the other columns of df are preserved as-is.

You can access the json content as follows:

df.select(col('json.header').alias('header'))

Summery

It’s all About this issue. Hope all solution helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which solution worked for you? Thank You.

Also, Read