close

[Solved] Spark: error reading DateType columns in partitioned parquet data

Hello Guys, How are you all? Hope You all Are Fine. Today I get the following error Spark: error reading DateType columns in partitioned parquet data in python. So Here I am Explain to you all the possible solutions here.

Without wasting your time, Let’s start This Article to Solve This Error.

How Spark: error reading DateType columns in partitioned parquet data Error Occurs?

Today I get the following error Spark: error reading DateType columns in partitioned parquet data in python.

How To Solve Spark: error reading DateType columns in partitioned parquet data Error ?

  1. How To Solve Spark: error reading DateType columns in partitioned parquet data Error ?

    To Solve Spark: error reading DateType columns in partitioned parquet data ErrorI just used StringType instead of DateType when writing parquet. Don't have the issue anymore.

  2. Spark: error reading DateType columns in partitioned parquet data

    To Solve Spark: error reading DateType columns in partitioned parquet data ErrorI just used StringType instead of DateType when writing parquet. Don't have the issue anymore.

Solution 1

I just used StringType instead of DateType when writing parquet. Don’t have the issue anymore.

Solution 2

I had this exception when Spark was reading the Parquet file generated from JSON file.

TLDR: If possible, re-write the input Parquet with the expected schema forcefully applied.

Scala code below. Python won’t be too different.

This is pretty much how my Parquet generation looked like at first:

spark.read
  .format("json")
  .load("<path-to-json-file>.json")
  .write
  .parquet("<path-to-output-directory>")

But the Spark job which would read the above Parquet was enforcing the schema on the input. About like this:

val structType: StructType = StructType(fields = Seq(...))
spark.read.schema(structType) 

And above is where the exception basically occurs.

FIX: In order to fix the exception I had to forcefully apply the schema to the data I generated:

spark.read
  .schema(structType) // <===
  .format("json")
  .load("<path-to-json-file>.json")
  .write
  .parquet("<path-to-output-directory>")

To my understanding, the reason for the exception in my case was not (only) the String-Type->DateType conversion

But also the fact that when reading JSON, Spark assigns LongType to all numeric values. Thus, my Parquet would be saved with LongType fields.

And the Spark job reading that Parquet, presumably, struggled to convert LongType to IntegerType.

Summery

It’s all About this issue. Hope all solution helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which solution worked for you? Thank You.

Also, Read