![](/img/trans.png)
[英]How to filter all dataframe columns to an condition in Pyspark?
[英]How to filter pandas or pyspark dataframe values in columns?
我有一个具有以下值的数据框:
+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
| 0.0| 0.0| 0.0| 8655|
| 1.0| 0.0| 0.0| 141|
| 0.0| 0.0| 1.0| 782|
| 1.0| 0.0| 1.0| 107|
| 0.0| 1.0| 0.0| 96|
| 0.0| 1.0| 1.0| 20|
| 1.0| 1.0| 1.0| 46|
| 1.0| 1.0| 0.0| 153|
+--------+-------+--------------+-----+
我想要这样的行在其他列中不重复“ 1”
+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
| 1.0| 0.0| 0.0| 141|
| 0.0| 0.0| 1.0| 782|
| 0.0| 1.0| 0.0| 96|
我所做的就是使用where()
函数
df['count'].where(((asdf['tag_html'] == 1) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 0)) &
((asdf['tag_html'] == 0) | (asdf['tag_css'] == 1) | (asdf['tag_javascript'] == 0)) &
((asdf['tag_html'] == 0) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 1)))
这是结果
0 8655.0
1 141.0
2 782.0
3 NaN
4 96.0
5 NaN
6 46.0
7 NaN
在pandas或pyspark中有更好的方法吗?
通过使用mask
和布尔索引
df=df.assign(count=df['count'].mask(df.iloc[:,:3].eq(1).sum(1).gt(1)))
df
Out[513]:
tag_html tag_css tag_javascript count
0 0.0 0.0 0.0 8655.0
1 1.0 0.0 0.0 141.0
2 0.0 0.0 1.0 782.0
3 1.0 0.0 1.0 NaN
4 0.0 1.0 0.0 96.0
5 0.0 1.0 1.0 NaN
6 1.0 1.0 1.0 NaN
7 1.0 1.0 0.0 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.