繁体   English   中英

如何过滤列中的pandas或pyspark数据框值?

[英]How to filter pandas or pyspark dataframe values in columns?

我有一个具有以下值的数据框:

+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
|     0.0|    0.0|           0.0| 8655|
|     1.0|    0.0|           0.0|  141|
|     0.0|    0.0|           1.0|  782|
|     1.0|    0.0|           1.0|  107|
|     0.0|    1.0|           0.0|   96|
|     0.0|    1.0|           1.0|   20|
|     1.0|    1.0|           1.0|   46|
|     1.0|    1.0|           0.0|  153|
+--------+-------+--------------+-----+

我想要这样的行在其他列中不重复“ 1”

+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
|     1.0|    0.0|           0.0|  141|
|     0.0|    0.0|           1.0|  782|
|     0.0|    1.0|           0.0|   96|

我所做的就是使用where()函数

df['count'].where(((asdf['tag_html'] == 1) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 0)) & 
               ((asdf['tag_html'] == 0) | (asdf['tag_css'] == 1) | (asdf['tag_javascript'] == 0)) &
               ((asdf['tag_html'] == 0) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 1)))

这是结果

0    8655.0
1     141.0
2     782.0
3       NaN
4      96.0
5       NaN
6      46.0
7       NaN

在pandas或pyspark中有更好的方法吗?

通过使用mask和布尔索引

df=df.assign(count=df['count'].mask(df.iloc[:,:3].eq(1).sum(1).gt(1)))
df
Out[513]: 
   tag_html  tag_css  tag_javascript   count
0       0.0      0.0             0.0  8655.0
1       1.0      0.0             0.0   141.0
2       0.0      0.0             1.0   782.0
3       1.0      0.0             1.0     NaN
4       0.0      1.0             0.0    96.0
5       0.0      1.0             1.0     NaN
6       1.0      1.0             1.0     NaN
7       1.0      1.0             0.0     NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM