[英]Use regular expression to extract elements from a pandas data frame
From the following data frame:来自以下数据框:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series.我的最终目标是在熊猫系列中提取字母 a、b 或 c(作为字符串)。 For that I am using the
.findall()
method from the re
module, as shown below:为此,我使用了
re
模块中的.findall()
方法,如下所示:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output ie the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:问题是输出,即每行中的字母 a、b 或 c,将出现在(单个元素的)列表中,如下所示:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:虽然我想将字母 a、b 或 c 作为字符串,如下所示:
0 a
1 b
2 c
3 a
I know that if I combine re.search()
with .group()
I can get a string, but if I do:我知道,如果我将
re.search()
与.group()
结合使用,我可以获得一个字符串,但如果我这样做:
df['col1'].str.search(pat).group()
I will get the following error message:我将收到以下错误消息:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split()
won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (eg I might want to capture ab
)使用
.str.split()
不会完成这项工作,因为在我的原始数据框中,我想捕获可能包含分隔符的字符串(例如,我可能想捕获ab
)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?有没有人知道一个简单的解决方案,也许避免诸如 for 循环或列表理解之类的迭代操作?
Fix your code修复您的代码
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
简单地尝试str.split()像这样 -
df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:输出:
col1
0 a
1 b
2 c
3 a
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.