简体   繁体   English

使用正则表达式从 Pandas 数据框中提取元素

[英]Use regular expression to extract elements from a pandas data frame

From the following data frame:来自以下数据框:

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

My ultimate goal is to extract the letters a, b or c (as string) in a pandas series.我的最终目标是在熊猫系列中提取字母 a、b 或 c(作为字符串)。 For that I am using the .findall() method from the re module, as shown below:为此,我使用了re模块中的.findall()方法,如下所示:

# import the module
import re
# define the patterns
pat = 'a|b|c'

# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)

The problem is that the output ie the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:问题是输出,即每行中的字母 a、b 或 c,将出现在(单个元素的)列表中,如下所示:

Out[301]: 
0    [a]
1    [b]
2    [c]
3    [a]

While I would like to have the letters a, b or c as string, as shown below:虽然我想将字母 a、b 或 c 作为字符串,如下所示:

0    a
1    b
2    c
3    a

I know that if I combine re.search() with .group() I can get a string, but if I do:我知道,如果我将re.search().group()结合使用,我可以获得一个字符串,但如果我这样做:

df['col1'].str.search(pat).group()

I will get the following error message:我将收到以下错误消息:

AttributeError: 'StringMethods' object has no attribute 'search'

Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (eg I might want to capture ab )使用.str.split()不会完成这项工作,因为在我的原始数据框中,我想捕获可能包含分隔符的字符串(例如,我可能想捕获ab

Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?有没有人知道一个简单的解决方案,也许避免诸如 for 循环或列表理解之类的迭代操作?

Use extract with capturing groups:提取物与捕获组一起使用:

import pandas as pd

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

result = df['col1'].str.extract('(a|b|c)')

print(result)

Output输出

   0
0  a
1  b
2  c
3  a

Fix your code修复您的代码

pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]: 
0    a
1    b
2    c
3    a
Name: col1, dtype: object

Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)简单地尝试str.split()像这样 - df["col1"].str.split("-", n = 1, expand = True)

import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True) 
print(df.head())

Output:输出:

  col1
0    a
1    b
2    c
3    a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM