[英]How to filter a Pandas dataframe in python based on column value comparison?
If you have a Pandas dataframe like this, filtering this way works: 如果您具有这样的Pandas数据框,则可以通过以下方式进行过滤:
df = pd.DataFrame({'name1': ['apple','pear','applepie','APPLE'],
'name2': ['APPLE','PEAR','apple','APPLE']
})
df[df['name1'] != df['name2']] # works
But how do you filter rows, when you want to compare the upper values of the columns? 但是,当您要比较列的上限值时,如何过滤行?
df[df['name1'].upper() != df['name2'].upper()] # does not work
You need to use pandas.Series.str.upper()
as df['name1']
is a series of strings and hence we use .str
string accessor for vectorized string manipulation. 您需要使用pandas.Series.str.upper()
因为df['name1']
是一系列字符串,因此我们使用.str
字符串访问器进行矢量化字符串操作。
df[df['name1'].str.upper() != df['name2'].str.upper()]
Output: 输出:
name1 name2
2 applepie apple
Often times it can be faster to use list comprehensions when dealing with strings in pandas. 通常,在处理大熊猫中的字符串时,使用列表理解可能会更快。
pd.DataFrame(
[[i, j] for i, j in zip(df.name1, df.name2) if i.upper() != j.upper()],
columns=df.columns
)
name1 name2
0 applepie apple
Some timings: 一些时间:
In [159]: df = pd.concat([df]*10000)
In [160]: %%timeit
...: pd.DataFrame(
...: [[i, j] for i, j in zip(df.name1, df.name2) if i.upper() != j.upper()]
...: ,
...: columns=df.columns
...: )
...:
14.2 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit df[df['name1'].str.upper() != df['name2'].str.upper()]
35.6 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For ASCII-only, check above :) 对于仅ASCII,请检查以上内容:)
Just as an observation, following this very good answer from @Veedrac, if you want to compare case-insensitive for lots of rows in many languages, you might want to normalize
and casefold
the values first 就像一个观察,遵循@Veedrac的一个很好的答案 ,如果您想比较多种语言中许多行的不区分大小写,则可能需要casefold
对值进行normalize
和casefold
df.col.str.normalize('NFKD').transform(str.casefold)
Example 例
df=pd.DataFrame({'t':['a','b','A', 'ê', 'ê', 'Ê', 'ß', 'ss']})
df.t.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
and 和
df.t.str.lower().duplicated()
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
But 但
df.t.str.normalize('NFKD').transform(str.casefold).duplicated(keep=False)
0 True
1 False
2 True
3 True
4 True
5 True
6 True
7 True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.