![](/img/trans.png)
[英]Pyspark, how to append a dataframe but remove duplicates from a specific one
[英]Remove duplicates from a dataframe in PySpark
我在本地处理 pyspark 1.4 中的数据框,并且在使dropDuplicates
方法正常工作时遇到问题。 它不断返回错误:
“AttributeError:‘list’对象没有属性‘dropDuplicates’”
不太清楚为什么,因为我似乎遵循最新文档中的语法。
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()
这不是进口问题。 您只需在错误的对象上调用.dropDuplicates()
即可。 虽然sqlContext.createDataFrame(rdd1, ...)
的类是pyspark.sql.dataframe.DataFrame
,但应用.collect()
后它是一个普通的 Python list
,并且列表不提供dropDuplicates
方法。 你想要的是这样的:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()
如果您有一个数据框并且想要删除所有重复项 - 参考特定列中的重复项(称为“colName”):
重复数据删除前的计数:
df.count()
执行重复数据删除(将要重复数据删除的列转换为字符串类型):
from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))
df.drop_duplicates(subset=['colName']).count()
可以使用排序的 groupby 来检查是否已删除重复项:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
总而言之, distinct()
和dropDuplicates()
方法删除重复项有一个区别,这是必不可少的。
dropDuplicates()
更适合只考虑列的一个子集
data = [("James","","Smith","36636","M",60000),
("James","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Maria","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
df.groupBy('first_name').agg(count(
'first_name').alias("count_duplicates")).filter(
col('count_duplicates') >= 2).show()
df.dropDuplicates(['first_name']).show()
# output
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob |gender|salary|
+----------+-----------+---------+-----+------+------+
|James | |Smith |36636|M |60000 |
|James |Rose | |40288|M |70000 |
|Robert | |Williams |42114| |400000|
|Maria |Anne |Jones |39192|F |500000|
|Maria |Mary |Brown | |F |0 |
+----------+-----------+---------+-----+------+------+
+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
| James| 2|
| Maria| 2|
+----------+----------------+
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name| dob|gender|salary|
+----------+-----------+---------+-----+------+------+
| James| | Smith|36636| M| 60000|
| Maria| Anne| Jones|39192| F|500000|
| Robert| | Williams|42114| |400000|
+----------+-----------+---------+-----+------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.