如何过滤列中的pandas或pyspark数据框值？

Question

我有一个具有以下值的数据框：

+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
|     0.0|    0.0|           0.0| 8655|
|     1.0|    0.0|           0.0|  141|
|     0.0|    0.0|           1.0|  782|
|     1.0|    0.0|           1.0|  107|
|     0.0|    1.0|           0.0|   96|
|     0.0|    1.0|           1.0|   20|
|     1.0|    1.0|           1.0|   46|
|     1.0|    1.0|           0.0|  153|
+--------+-------+--------------+-----+

我想要这样的行在其他列中不重复“ 1”

+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
|     1.0|    0.0|           0.0|  141|
|     0.0|    0.0|           1.0|  782|
|     0.0|    1.0|           0.0|   96|

我所做的就是使用where()函数

df['count'].where(((asdf['tag_html'] == 1) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 0)) & 
               ((asdf['tag_html'] == 0) | (asdf['tag_css'] == 1) | (asdf['tag_javascript'] == 0)) &
               ((asdf['tag_html'] == 0) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 1)))

这是结果

0    8655.0
1     141.0
2     782.0
3       NaN
4      96.0
5       NaN
6      46.0
7       NaN

在pandas或pyspark中有更好的方法吗？

Answer 1

通过使用mask和布尔索引

df=df.assign(count=df['count'].mask(df.iloc[:,:3].eq(1).sum(1).gt(1)))
df
Out[513]: 
   tag_html  tag_css  tag_javascript   count
0       0.0      0.0             0.0  8655.0
1       1.0      0.0             0.0   141.0
2       0.0      0.0             1.0   782.0
3       1.0      0.0             1.0     NaN
4       0.0      1.0             0.0    96.0
5       0.0      1.0             1.0     NaN
6       1.0      1.0             1.0     NaN
7       1.0      1.0             0.0     NaN

如何过滤列中的pandas或pyspark数据框值？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-12-10 00:37:33

如何过滤列中的pandas或pyspark数据框值？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-12-10 00:37:33

解决方案1
0 已采纳 2017-12-10 00:37:33