# How to calculate mean and standard deviation given a PySpark DataFrame?

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to calculate mean and standard deviation given a PySpark DataFrame in Python. So Here I am Explain to you all the possible Methods here.

## How to calculate mean and standard deviation given a PySpark DataFrame?

1. How to calculate mean and standard deviation given a PySpark DataFrame?

You can use the built in functions to get aggregate statistics. Here's how to get mean and standard deviation.

2. calculate mean and standard deviation given a PySpark DataFrame

You can use the built in functions to get aggregate statistics. Here's how to get mean and standard deviation.

## Method 1

You can use the built in functions to get aggregate statistics. Here’s how to get mean and standard deviation.

```from pyspark.sql.functions import mean as _mean, stddev as _stddev, col

df_stats = df.select(
_mean(col('columnName')).alias('mean'),
_stddev(col('columnName')).alias('std')
).collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']
```

Note that there are three different standard deviation functions. From the docs the one I used (`stddev`) returns the following:

Aggregate function: returns the unbiased sample standard deviation of the expression in a group

You could use the `describe()` method as well:

```df.describe().show()
```

UPDATE: This is how you can work through the nested data.

Use `explode` to extract the values into separate rows, then call `mean` and `stddev` as shown above.

Here’s a MWE:

```from pyspark.sql.types import IntegerType
from pyspark.sql.functions import explode, col, udf, mean as _mean, stddev as _stddev

# mock up sample dataframe
df = sqlCtx.createDataFrame(
[(680, [[691,1], [692,5]]), (685, [[691,2], [692,2]]), (684, [[691,1], [692,3]])],
["product_PK", "products"]
)

# udf to get the "score" value - returns the item at index 1
get_score = udf(lambda x: x[1], IntegerType())

# explode column and get stats
df_stats = df.withColumn('exploded', explode(col('products')))\
.withColumn('score', get_score(col('exploded')))\
.select(
_mean(col('score')).alias('mean'),
_stddev(col('score')).alias('std')
)\
.collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']

print([mean, std])
```

Which outputs:

```[2.3333333333333335, 1.505545305418162]
```

You can verify that these values are correct using `numpy`:

```vals = [1,5,2,2,1,3]
print([np.mean(vals), np.std(vals, ddof=1)])
```

Explanation: Your `"products"` column is a `list` of `list`s. Calling `explode` will make a new row for each element of the outer `list`. Then grab the `"score"` value from each of the exploded rows, which you have defined as the second element in a 2-element `list`. Finally, call the aggregate functions on this new column.

## Method 2

For Standard Deviation, better way of writing is as below. We can use formatting (to 2 decimal) and using the column Alias name

```data_agg=SparkSession.builder.appName('Sales_fun').getOrCreate()