從 PySpark 中的數據框中刪除重復項

Question

我在本地處理 pyspark 1.4 中的數據框，並且在使dropDuplicates方法正常工作時遇到問題。 它不斷返回錯誤：

“AttributeError：‘list’對象沒有屬性‘dropDuplicates’”

不太清楚為什么，因為我似乎遵循最新文檔中的語法。

#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()

#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()

#dropping duplicates from the dataframe
df1.dropDuplicates().show()

Answer 1

這不是進口問題。 您只需在錯誤的對象上調用.dropDuplicates()即可。 雖然sqlContext.createDataFrame(rdd1, ...)的類是pyspark.sql.dataframe.DataFrame ，但應用.collect()后它是一個普通的 Python list ，並且列表不提供dropDuplicates方法。 你想要的是這樣的：

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()

Answer 2

如果您有一個數據框並且想要刪除所有重復項 - 參考特定列中的重復項（稱為“colName”）：

重復數據刪除前的計數：

df.count()

執行重復數據刪除（將要重復數據刪除的列轉換為字符串類型）：

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

可以使用排序的 groupby 來檢查是否已刪除重復項：

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

Answer 3

總而言之， distinct()和dropDuplicates()方法刪除重復項有一個區別，這是必不可少的。

dropDuplicates()更適合只考慮列的一個子集

data = [("James","","Smith","36636","M",60000),
        ("James","Rose","","40288","M",70000),
        ("Robert","","Williams","42114","",400000),
        ("Maria","Anne","Jones","39192","F",500000),
        ("Maria","Mary","Brown","","F",0)]

columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

df.groupBy('first_name').agg(count(
  'first_name').alias("count_duplicates")).filter(
  col('count_duplicates') >= 2).show()

df.dropDuplicates(['first_name']).show()

# output

+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob  |gender|salary|
+----------+-----------+---------+-----+------+------+
|James     |           |Smith    |36636|M     |60000 |
|James     |Rose       |         |40288|M     |70000 |
|Robert    |           |Williams |42114|      |400000|
|Maria     |Anne       |Jones    |39192|F     |500000|
|Maria     |Mary       |Brown    |     |F     |0     |
+----------+-----------+---------+-----+------+------+

+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
|     James|               2|
|     Maria|               2|
+----------+----------------+

+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|  dob|gender|salary|
+----------+-----------+---------+-----+------+------+
|     James|           |    Smith|36636|     M| 60000|
|     Maria|       Anne|    Jones|39192|     F|500000|
|    Robert|           | Williams|42114|      |400000|
+----------+-----------+---------+-----+------+------+

從 PySpark 中的數據框中刪除重復項

問題描述

3 個解決方案

解決方案1
44 已采納 2015-06-26 03:22:43

解決方案2
21 2018-01-02 14:40:33

解決方案3
0 2022-06-07 17:32:01

從 PySpark 中的數據框中刪除重復項

問題描述

3 個解決方案

解決方案1 44 已采納 2015-06-26 03:22:43

解決方案2 21 2018-01-02 14:40:33

解決方案3 0 2022-06-07 17:32:01

解決方案1
44 已采納 2015-06-26 03:22:43

解決方案2
21 2018-01-02 14:40:33

解決方案3
0 2022-06-07 17:32:01