I am new to Python/PySpark and I am having trouble cleansing the data before using it on my Mac's terminal. I want to delete any row that contains null values or repeated rows. I used .distinct()
and tried with:
rw_data3 = rw_data.filter(rw_data.isNotNull())
I also tried...
from functools import reduce
rw_data.filter(~reduce(lambda x, y: x & y, [rw_data[c].isNull() for c in
rw_data.columns])).show()
but I get
"AttributeError: 'RDD' object has no attribute 'isNotNull'"
or
"AttributeError: 'RDD' object has no attribute 'columns'"
Which clearly shows I do not really understand the syntax for cleaning up the DataFrame
It looks like you have an rdd
, and not a DataFrame. You can easily convert the rdd
to a DataFrame and then use pyspark.sql.DataFrame.dropna()
and pyspark.sql.DataFrame.dropDuplicates()
to "clean" it.
clean_df = rw_data3.toDF().dropna().dropDuplicates()
Both of these functions accept and optional parameter subset
, which you can use to specify a subset of columns to search for null
s and duplicates.
If you wanted to "clean" your data as an rdd
, you can use filter()
and distinct()
as follows:
clean_rdd = rw_data2.filter(lambda row: all(x is not None for x in row)).distinct()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.