close

How can I write a parquet file using Spark (pyspark)?

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How can I write a parquet file using Spark (pyspark) in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

Table of Contents

How can I write a parquet file using Spark (pyspark)?

  1. How can I write a parquet file using Spark (pyspark)?

    The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

  2. write a parquet file using Spark (pyspark)

    The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

Method 1

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
df = spark.read.csv("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.show()

df.write.parquet("output/proto.parquet")

Method 2

You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.

Here’s the Koala code:

import databricks.koalas as ks

df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')

Summery

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read