close

How to filter based on array value in PySpark?

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to filter based on array value in PySpark in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

Table of Contents

How to filter based on array value in PySpark?

  1. How to filter based on array value in PySpark?

    If you want to use more complex predicates you'll have to either explode or use an UDF, for example something like this:

  2. filter based on array value in PySpark

    If you want to use more complex predicates you'll have to either explode or use an UDF, for example something like this:

Method 1

For equality based queries you can use array_contains:

df = sc.parallelize([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF(["k", "v"])
df.createOrReplaceTempView("df")

# With SQL
sqlContext.sql("SELECT * FROM df WHERE array_contains(v, 1)")

# With DSL
from pyspark.sql.functions import array_contains
df.where(array_contains("v", 1))

If you want to use more complex predicates you’ll have to either explode or use an UDF, for example something like this:

from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf 

def exists(f):
    return udf(lambda xs: any(f(x) for x in xs), BooleanType())

df.where(exists(lambda x: x > 3)("v"))

In Spark 2.4. or later it is also possible to use higher order functions

from pyspark.sql.functions import expr

df.where(expr("""aggregate(
    transform(v, x -> x > 3),
    false, 
    (x, y) -> x or y
)"""))

or

df.where(expr("""
    exists(v, x -> x > 3)
"""))

Python wrappers should be available in 3.1 

Method 2

In spark 2.4 you can filter array values using filter function in sql API.

Here’s example in pyspark. In the example we filter out all array values which are empty strings:

df = df.withColumn("ArrayColumn", expr("filter(ArrayColumn, x -> x != '')"))

Conclusion

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read