[英]Search for a partial string match in a data frame column from a list - Pandas - Python
I have a list: 我有一个清单:
things = ['A1','B2','C3']
I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This= A1 ;10001;0') 我有一个pandas数据框,其中一列包含以分号分隔的值 - 一些行将包含与上面列表中的一个项匹配的匹配(由于它具有字符串的其他部分,因此不会完美匹配)列...例如,该列中的一行可能有'哇;这里;这= A1 ; 10001; 0')
I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). 我想保存包含与列表中的项匹配的行,然后使用这些选定的行创建一个新的数据框(应该具有相同的标题)。 This is what I tried:
这是我试过的:
import re
for_new_df =[]
for x in df['COLUMN']:
for mp in things:
if df[df['COLUMN'].str.contains(mp)]:
for_new_df.append(mp) #This won't save the whole row - help here too, please.
This code gave me an error: 这段代码给了我一个错误:
ValueError: The truth value of a DataFrame is ambiguous. ValueError:DataFrame的真值是不明确的。 Use a.empty, a.bool(), a.item(), a.any() or a.all().
使用a.empty,a.bool(),a.item(),a.any()或a.all()。
I'm very new to coding, so the more explanation and detail in your answer, the better! 我对编码很新,所以答案中的解释和细节越多越好! Thanks in advance.
提前致谢。
You can avoid the loop by joining your list of words to create a regex and use str.contains
: 您可以通过加入单词列表来创建正则表达式并使用
str.contains
来避免循环:
pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]
should just work 应该工作
So the regex pattern becomes: 'A1|B2|C3'
and this will match anywhere in your strings that contain any of these strings 因此正则表达式模式变为:
'A1|B2|C3'
,这将匹配包含任何这些字符串的字符串中的任何位置
Example: 例:
In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]
Out[65]:
a
0 Wow;Here;This=A1;10001;0
1 B2
As to why it failed: 至于它失败的原因:
if df[df['COLUMN'].str.contains(mp)]
this line: 这一行:
df[df['COLUMN'].str.contains(mp)]
returns a df masked by the boolean array of your inner str.contains
, if
doesn't understand how to evaluate an array of booleans hence the error. 返回一个由内部
str.contains
的boolean数组掩盖的df, if
不了解如何评估一个布尔数组,从而导致错误。 If you think about it what should it do if you 1 True or all but one True? 如果你想一想,如果你是真的或者只有一个是真的,它该怎么办? it expects a scalar and not an array like value.
它期望一个标量,而不是像数组一样的值。
Pandas is actually amazing but I don't find it very easy to use. 熊猫实际上是惊人的,但我觉得它很容易使用。 However it does have many functions designed to make life easy, including tools for searching through huge data frames.
然而,它确实具有许多旨在简化生活的功能,包括用于搜索大量数据帧的工具。
Though it may not be a full solution to your problem, this may help set you off on the right foot. 虽然它可能不是您问题的完整解决方案,但这可能会帮助您摆脱困境。 I have assumed that you know which column you are searching in, column A in my example.
我假设您知道要搜索的列,在我的示例中为A列。
import pandas as pd
df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']),
'B' : 'foo'})
print df #Original data frame
print
print df['A'].str.contains('A1|B2|C3') # Boolean array showing matches for col A
print
print df[df['A'].str.contains('A1|B2|C3')] # Matching rows
The output: 输出:
A B
0 Wow;Here;This=A1;10001;0 foo
1 Another;C3;Row=Great;100 foo
2 This;D6;Row=bad100 foo
0 True
1 True
2 False
Name: A, dtype: bool
A B
0 Wow;Here;This=A1;10001;0 foo
1 Another;C3;Row=Great;100 foo
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.