Pyspark
df = spark.createDataFrame([
(“p001”, 1020, None),
(“p002”, 560, “delivered”),
(“p003”, None, “delivered”),
(“p004”, None, None)],
[“productID”, “unit”, “status”])
df.show()
df.count()
+---------+----+---------+
|productID|unit| status|
+---------+----+---------+
| p001|1020| null|
| p002| 560|delivered|
| p003|null|delivered|
| p004|null| null|
+---------+----+---------+
Out[1]: 4
1)欠損値の件数
isNull isNotNillで欠損値がない列をフィルタして数えます。
PySpark
df2 = df.filter((df[“productID”].isNotNull() & df[“unit”].isNotNull() & df[“status”].isNotNull()))
df2.show()
df2.count()
+---------+----+---------+ |productID|unit| status| +---------+----+---------+ | p002| 560|delivered| +---------+----+---------+ Out[4]: 1
PySpark
df2 = df.filter((df[“productID”].isNull() | df[“unit”].isNull() | df[“status”].isNull()))
df2.show()
df2.count()
+---------+----+---------+ |productID|unit| status| +---------+----+---------+ | p001|1020| null| | p003|null|delivered| | p004|null| null| +---------+----+---------+
Out[3]: 3
2)欠損値の削除
Dropnaで色んな欠損値の条件で行を削除します。
Any - 欠損値が一例以上ある
All - 欠損値が全ての例以上ある
Thresh - 欠損値が何件以上ある
PySpark
df2 = df.dropna(“any”)
df2.show()
+---------+----+---------+ |productID|unit| status| +---------+----+---------+ | p002| 560|delivered| +---------+----+---------+
PySpark
df3 = df.dropna(“all”)
df3.show()
+---------+----+---------+ |productID|unit| status| +---------+----+---------+ | p001|1020| null| | p002| 560|delivered| | p003|null|delivered| | p004|null| null| +---------+----+---------+
PySpark
df4 = df.dropna(thresh=2)
df4.show()
+---------+----+---------+ |productID|unit| status| +---------+----+---------+ | p001|1020| null| | p002| 560|delivered| | p003|null|delivered| +---------+----+---------+
3)欠損値の変換
fillnaで欠損値に変換します。
PySpark
df5 = df.fillna(0)
df6 = df.na.fill({“unit”: 0, “status”: “In progress”})
df5.show()
df6.show()
+---------+----+---------+ |productID|unit| status| +---------+----+---------+ | p001|1020| null| | p002| 560|delivered| | p003| 0|delivered| | p004| 0| null| +---------+----+---------+ +---------+----+-----------+ |productID|unit| status| +---------+----+-----------+ | p001|1020|In progress| | p002| 560| delivered| | p003| 0| delivered| | p004| 0|In progress| +---------+----+-----------+