简体   繁体   English

Python使用正则表达式在pandas列中提取字符串的一部分

[英]Python using regex to extract parts of a string in pandas column

I've got a pandas df column called 'Raw' for which the format is inconsistent. 我有一个名为“ Raw”的pandas df列,其格式不一致。 The strings it contains look like that: 它包含的字符串如下所示:

'(1T XXX, Europe)'
'(2T YYYY, Latin America)'
'(3T ZZ/ZZZZ, Europe)'
'(4T XXX XXX, Africa)'

The only thing consistent in the strings in 'Raw' is that they start with a digit, includes a comma in the middle followed by a whitespace, and they contain parentheses as well. “原始”(Raw)字符串中唯一一致的地方是,它们以数字开头,中间包含逗号,后跟空格,并且还包含括号。

Now, I'd like to create two extra columns (Model and Region) in my dataframe: 现在,我想在数据框中创建两个额外的列(“模型”和“区域”):

  • 'Model' would contain the beginning of the string, ie everything between the first parenthesis and the comma “模型”将包含字符串的开头,即第一个括号和逗号之间的所有内容
  • 'Region' would contain the end of the string, ie everything between the whitespace after the comma and the final parenthesis “区域”将包含字符串的末尾,即逗号后的空格和最后的括号之间的所有内容

How do I do that using regex? 我该如何使用正则表达式呢?

Since there's only one comma, and everything is between parentheses, in your case, use .str.split() instead, after slicing appropriately: 由于只有一个逗号,并且所有内容都在括号之间,因此,在适当切片后,请使用.str.split()代替:

model_region = df.Raw.str[1:-1].str.split(', ', expand = True)

But if you insist: 但是,如果您坚持:

model_region = df.Raw.str.extract('\((.*), (.*)\)', expand = True)

Then 然后

df['Model'] = model_region[0]
df['Region'] = model_region[1]

Try this : \\(([^,]*), ([^)]*)\\) 试试这个: \\(([^,]*), ([^)]*)\\)

See : https://regex101.com/r/fCetWg/1 参见: https : //regex101.com/r/fCetWg/1

import re

s = '(3T ZZ/ZZZZ, Europe)'
m=re.search(r'\((.*), (.*)\)',s)
print(m.groups())
Model=re.findall(r"(?<=\().+(?=\,)",s)
Region=re.findall(r"(?<=\, ).+(?=\))",s)

The first regex checks for opening bracket "(" in front of the model and closing ",". The second regex checks for any string between "," and ")". 第一个正则表达式检查模型前面的括号“(”,然后关闭“,”。第二个正则表达式检查“,”和“)”之间的任何字符串。

string_list = ['(1T XXX, Europe)',
'(2T YYYY, Latin America)',
'(3T ZZ/ZZZZ, Europe)',
'(4T XXX XXX, Africa)']
df = pd.DataFrame(string_list)
df = df[0].str.extract("\(([^,]*), ([^)]*)\)", expand=False)

If the comma is a reliable separator of your string parts, then you do not need regexp. 如果逗号是字符串部分的可靠分隔符,则不需要regexp。 If df is your dataframe: 如果df是您的数据帧:

df['Model'] = [x.split(',')[0].replace('(', '') for x in df['Raw']]
df['Region'] = [x.split(',')[1].replace(')', '') for x in df['Raw']]

if you want to use regexp is would look something like: 如果您想使用regexp,它将类似于:

s = '(1T XXX, Europe)'
m = re.match('\(([\w\s]+),([\w\s]+)\)', s)
model = m.group(1)
region = m.group(2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM