从列表中的数据框列中搜索部分字符串匹配 - Pandas - Python

Question

I have a list: 我有一个清单：

things = ['A1','B2','C3']

I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This= A1 ;10001;0') 我有一个pandas数据框，其中一列包含以分号分隔的值 - 一些行将包含与上面列表中的一个项匹配的匹配（由于它具有字符串的其他部分，因此不会完美匹配）列...例如，该列中的一行可能有'哇;这里;这= A1 ; 10001; 0'）

I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). 我想保存包含与列表中的项匹配的行，然后使用这些选定的行创建一个新的数据框（应该具有相同的标题）。 This is what I tried: 这是我试过的：

import re

for_new_df =[]

for x in df['COLUMN']:
    for mp in things:
        if df[df['COLUMN'].str.contains(mp)]:
            for_new_df.append(mp)  #This won't save the whole row - help here too, please.

This code gave me an error: 这段代码给了我一个错误：

ValueError: The truth value of a DataFrame is ambiguous. ValueError：DataFrame的真值是不明确的。 Use a.empty, a.bool(), a.item(), a.any() or a.all(). 使用a.empty，a.bool（），a.item（），a.any（）或a.all（）。

I'm very new to coding, so the more explanation and detail in your answer, the better! 我对编码很新，所以答案中的解释和细节越多越好！ Thanks in advance. 提前致谢。

Answer 1

You can avoid the loop by joining your list of words to create a regex and use str.contains : 您可以通过加入单词列表来创建正则表达式并使用str.contains来避免循环：

pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]

should just work 应该工作

So the regex pattern becomes: 'A1|B2|C3' and this will match anywhere in your strings that contain any of these strings 因此正则表达式模式变为： 'A1|B2|C3' ，这将匹配包含任何这些字符串的字符串中的任何位置

Example: 例：

In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]

Out[65]:
                          a
0  Wow;Here;This=A1;10001;0
1                        B2

As to why it failed: 至于它失败的原因：

if df[df['COLUMN'].str.contains(mp)]

this line: 这一行：

df[df['COLUMN'].str.contains(mp)]

returns a df masked by the boolean array of your inner str.contains , if doesn't understand how to evaluate an array of booleans hence the error. 返回一个由内部str.contains的boolean数组掩盖的df， if不了解如何评估一个布尔数组，从而导致错误。 If you think about it what should it do if you 1 True or all but one True? 如果你想一想，如果你是真的或者只有一个是真的，它该怎么办？ it expects a scalar and not an array like value. 它期望一个标量，而不是像数组一样的值。

Answer 2

Pandas is actually amazing but I don't find it very easy to use. 熊猫实际上是惊人的，但我觉得它很容易使用。 However it does have many functions designed to make life easy, including tools for searching through huge data frames. 然而，它确实具有许多旨在简化生活的功能，包括用于搜索大量数据帧的工具。

Though it may not be a full solution to your problem, this may help set you off on the right foot. 虽然它可能不是您问题的完整解决方案，但这可能会帮助您摆脱困境。 I have assumed that you know which column you are searching in, column A in my example. 我假设您知道要搜索的列，在我的示例中为A列。

import pandas as pd

df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']),
                   'B' : 'foo'})
print df #Original data frame
print
print df['A'].str.contains('A1|B2|C3')  # Boolean array showing matches for col A
print
print df[df['A'].str.contains('A1|B2|C3')]   # Matching rows

The output: 输出：

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo
2        This;D6;Row=bad100  foo

0     True
1     True
2    False
Name: A, dtype: bool

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo

从列表中的数据框列中搜索部分字符串匹配 - Pandas - Python

问题描述

2 个解决方案

解决方案1
9 已采纳 2016-07-12 15:50:46

解决方案2
2 2016-07-12 16:36:38

从列表中的数据框列中搜索部分字符串匹配 - Pandas - Python

问题描述

2 个解决方案

解决方案1 9 已采纳 2016-07-12 15:50:46

解决方案2 2 2016-07-12 16:36:38

解决方案1
9 已采纳 2016-07-12 15:50:46

解决方案2
2 2016-07-12 16:36:38