![](/img/trans.png)
[英]Pyspark, how to append a dataframe but remove duplicates from a specific one
[英]Remove duplicates from a dataframe in PySpark
我在本地處理 pyspark 1.4 中的數據框,並且在使dropDuplicates
方法正常工作時遇到問題。 它不斷返回錯誤:
“AttributeError:‘list’對象沒有屬性‘dropDuplicates’”
不太清楚為什么,因為我似乎遵循最新文檔中的語法。
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()
這不是進口問題。 您只需在錯誤的對象上調用.dropDuplicates()
即可。 雖然sqlContext.createDataFrame(rdd1, ...)
的類是pyspark.sql.dataframe.DataFrame
,但應用.collect()
后它是一個普通的 Python list
,並且列表不提供dropDuplicates
方法。 你想要的是這樣的:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()
如果您有一個數據框並且想要刪除所有重復項 - 參考特定列中的重復項(稱為“colName”):
重復數據刪除前的計數:
df.count()
執行重復數據刪除(將要重復數據刪除的列轉換為字符串類型):
from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))
df.drop_duplicates(subset=['colName']).count()
可以使用排序的 groupby 來檢查是否已刪除重復項:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
總而言之, distinct()
和dropDuplicates()
方法刪除重復項有一個區別,這是必不可少的。
dropDuplicates()
更適合只考慮列的一個子集
data = [("James","","Smith","36636","M",60000),
("James","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Maria","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
df.groupBy('first_name').agg(count(
'first_name').alias("count_duplicates")).filter(
col('count_duplicates') >= 2).show()
df.dropDuplicates(['first_name']).show()
# output
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob |gender|salary|
+----------+-----------+---------+-----+------+------+
|James | |Smith |36636|M |60000 |
|James |Rose | |40288|M |70000 |
|Robert | |Williams |42114| |400000|
|Maria |Anne |Jones |39192|F |500000|
|Maria |Mary |Brown | |F |0 |
+----------+-----------+---------+-----+------+------+
+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
| James| 2|
| Maria| 2|
+----------+----------------+
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name| dob|gender|salary|
+----------+-----------+---------+-----+------+------+
| James| | Smith|36636| M| 60000|
| Maria| Anne| Jones|39192| F|500000|
| Robert| | Williams|42114| |400000|
+----------+-----------+---------+-----+------+------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.