close

How to overwrite a parquet file from where DataFrame is being read in Spark

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to overwrite a parquet file from where DataFrame is being read in Spark in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

Table of Contents

How to overwrite a parquet file from where DataFrame is being read in Spark?

  1. How to overwrite a parquet file from where DataFrame is being read in Spark?

    One solution for this error is to cache, make an action to the df (example: df.show()) and then save the parquet file in “overwrite” mode.

  2. overwrite a parquet file from where DataFrame is being read in Spark

    One solution for this error is to cache, make an action to the df (example: df.show()) and then save the parquet file in “overwrite” mode.

Method 1

One solution for this error is to cache, make an action to the df (example: df.show()) and then save the parquet file in “overwrite” mode.

in python:

save_mode = "overwrite"
df = spark.read.parquet("path_to_parquet")

....... make your transformation to the df which is new_df

new_df.cache()
new_df.show()

new_df.write.format("parquet")\
                .mode(save_mode)\
                .save("path_to_parquet")

Method 2

When data is taken out of a cache it seems to work fine.

val df = spark.read.format("parquet").load("temp").cache()

cache is a lazy operation, and doesn’t trigger any computation, we have to add some dummy action.

println(df.count()) //count over parquet files should be very fast  

Now it should work:

df.repartition(1).write.mode(SaveMode.Overwrite).parquet("temp")

Summery

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read