Python使用正则表达式在pandas列中提取字符串的一部分

Question

I've got a pandas df column called 'Raw' for which the format is inconsistent. 我有一个名为“ Raw”的pandas df列，其格式不一致。 The strings it contains look like that: 它包含的字符串如下所示：

'(1T XXX, Europe)'
'(2T YYYY, Latin America)'
'(3T ZZ/ZZZZ, Europe)'
'(4T XXX XXX, Africa)'

The only thing consistent in the strings in 'Raw' is that they start with a digit, includes a comma in the middle followed by a whitespace, and they contain parentheses as well. “原始”（Raw）字符串中唯一一致的地方是，它们以数字开头，中间包含逗号，后跟空格，并且还包含括号。

Now, I'd like to create two extra columns (Model and Region) in my dataframe: 现在，我想在数据框中创建两个额外的列（“模型”和“区域”）：

'Model' would contain the beginning of the string, ie everything between the first parenthesis and the comma “模型”将包含字符串的开头，即第一个括号和逗号之间的所有内容
'Region' would contain the end of the string, ie everything between the whitespace after the comma and the final parenthesis “区域”将包含字符串的末尾，即逗号后的空格和最后的括号之间的所有内容

How do I do that using regex? 我该如何使用正则表达式呢？

Answer 1

Since there's only one comma, and everything is between parentheses, in your case, use .str.split() instead, after slicing appropriately: 由于只有一个逗号，并且所有内容都在括号之间，因此，在适当切片后，请使用.str.split()代替：

model_region = df.Raw.str[1:-1].str.split(', ', expand = True)

But if you insist: 但是，如果您坚持：

model_region = df.Raw.str.extract('\((.*), (.*)\)', expand = True)

Then 然后

df['Model'] = model_region[0]
df['Region'] = model_region[1]

Answer 2

Try this : \\(([^,]*), ([^)]*)\\) 试试这个： \\(([^,]*), ([^)]*)\\)

See : https://regex101.com/r/fCetWg/1 参见： https : //regex101.com/r/fCetWg/1

Answer 3

import re

s = '(3T ZZ/ZZZZ, Europe)'
m=re.search(r'\((.*), (.*)\)',s)
print(m.groups())

Answer 4

Model=re.findall(r"(?<=\().+(?=\,)",s)
Region=re.findall(r"(?<=\, ).+(?=\))",s)

The first regex checks for opening bracket "(" in front of the model and closing ",". The second regex checks for any string between "," and ")". 第一个正则表达式检查模型前面的括号“（”，然后关闭“，”。第二个正则表达式检查“，”和“）”之间的任何字符串。

Answer 5

string_list = ['(1T XXX, Europe)',
'(2T YYYY, Latin America)',
'(3T ZZ/ZZZZ, Europe)',
'(4T XXX XXX, Africa)']
df = pd.DataFrame(string_list)
df = df[0].str.extract("\(([^,]*), ([^)]*)\)", expand=False)

Answer 6

If the comma is a reliable separator of your string parts, then you do not need regexp. 如果逗号是字符串部分的可靠分隔符，则不需要regexp。 If df is your dataframe: 如果df是您的数据帧：

df['Model'] = [x.split(',')[0].replace('(', '') for x in df['Raw']]
df['Region'] = [x.split(',')[1].replace(')', '') for x in df['Raw']]

if you want to use regexp is would look something like: 如果您想使用regexp，它将类似于：

s = '(1T XXX, Europe)'
m = re.match('\(([\w\s]+),([\w\s]+)\)', s)
model = m.group(1)
region = m.group(2)

Python使用正则表达式在pandas列中提取字符串的一部分

问题描述

6 个解决方案

解决方案1
3 2017-07-05 09:32:42

解决方案2
1 2017-07-05 09:32:16

解决方案3
0 2017-07-05 09:35:33

解决方案4
0 2017-07-05 09:39:39

解决方案5
0 2017-07-05 09:42:53

解决方案6
0 2017-07-05 09:47:09

Python使用正则表达式在pandas列中提取字符串的一部分

问题描述

6 个解决方案

解决方案1 3 2017-07-05 09:32:42

解决方案2 1 2017-07-05 09:32:16

解决方案3 0 2017-07-05 09:35:33

解决方案4 0 2017-07-05 09:39:39

解决方案5 0 2017-07-05 09:42:53

解决方案6 0 2017-07-05 09:47:09

解决方案1
3 2017-07-05 09:32:42

解决方案2
1 2017-07-05 09:32:16

解决方案3
0 2017-07-05 09:35:33

解决方案4
0 2017-07-05 09:39:39

解决方案5
0 2017-07-05 09:42:53

解决方案6
0 2017-07-05 09:47:09