使用正则表达式从 Pandas 数据框中提取元素

Question

From the following data frame:来自以下数据框：

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

My ultimate goal is to extract the letters a, b or c (as string) in a pandas series.我的最终目标是在熊猫系列中提取字母 a、b 或 c（作为字符串）。 For that I am using the .findall() method from the re module, as shown below:为此，我使用了re模块中的.findall()方法，如下所示：

# import the module
import re
# define the patterns
pat = 'a|b|c'

# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)

The problem is that the output ie the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:问题是输出，即每行中的字母 a、b 或 c，将出现在（单个元素的）列表中，如下所示：

Out[301]: 
0    [a]
1    [b]
2    [c]
3    [a]

While I would like to have the letters a, b or c as string, as shown below:虽然我想将字母 a、b 或 c 作为字符串，如下所示：

I know that if I combine re.search() with .group() I can get a string, but if I do:我知道，如果我将re.search()与.group()结合使用，我可以获得一个字符串，但如果我这样做：

df['col1'].str.search(pat).group()

I will get the following error message:我将收到以下错误消息：

AttributeError: 'StringMethods' object has no attribute 'search'

Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (eg I might want to capture ab )使用.str.split()不会完成这项工作，因为在我的原始数据框中，我想捕获可能包含分隔符的字符串（例如，我可能想捕获ab ）

Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?有没有人知道一个简单的解决方案，也许避免诸如 for 循环或列表理解之类的迭代操作？

Answer 1

Use extract with capturing groups:将提取物与捕获组一起使用：

import pandas as pd

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

result = df['col1'].str.extract('(a|b|c)')

print(result)

Output输出

Answer 2

Fix your code修复您的代码

pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]: 
0    a
1    b
2    c
3    a
Name: col1, dtype: object

Answer 3

Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)简单地尝试str.split()像这样 - df["col1"].str.split("-", n = 1, expand = True)

import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True) 
print(df.head())

Output:输出：

  col1
0    a
1    b
2    c
3    a

使用正则表达式从 Pandas 数据框中提取元素

问题描述

3 个解决方案

解决方案1
1 2019-01-07 15:19:11

解决方案2
0 已采纳 2019-01-07 15:19:51

解决方案3
0 2019-01-07 15:20:26

使用正则表达式从 Pandas 数据框中提取元素

问题描述

3 个解决方案

解决方案1 1 2019-01-07 15:19:11

解决方案2 0 已采纳 2019-01-07 15:19:51

解决方案3 0 2019-01-07 15:20:26

解决方案1
1 2019-01-07 15:19:11

解决方案2
0 已采纳 2019-01-07 15:19:51

解决方案3
0 2019-01-07 15:20:26