正则表达式和熊猫：在名称匹配时提取部分字符串

Question

我有一个pandas数据框，其中包含两个人（客户和服务台操作员）之间的网络聊天实例。

当客户进入对话时，客户名称总是在网络聊天的第一行中宣布。

范例1：

在： df['log'][0]

离开： [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session. [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session.

范例2：

在： df['log'][1]

出： [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session. [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session.

由于不同的客户使用网络聊天服务，因此聊天中的名称始终会有所不同。

客户可以输入具有一个或多个名称的聊天。 例如： James Ravi Roy Andrews 。

要求：

我想将所有客户聊天实例（例如James和Roy Andrews聊天）从df['log']列分离到新列df[text_analysis] 。

从上面的示例1中可以看到：

于： df['text_analysis][0]

离开： [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:44:12] James: Thanks

编辑：最佳解决方案将提取上面示例中提供的子字符串，并省略最终时间戳[14:44:38] James has exited the session. 。

到目前为止，我已经尝试过：我已经使用以下方法将df['log']列中的客户名称提取到了一个名为df['names']的新列中：

df['names'] = df['log'].apply(lambda x: x.split(' ')[7].split('[')[0])

我想使用df['names']列中df['names']在str.split() pandas函数中使用-类似以下内容：

df['log'].str.split(df['names'])但是不起作用，如果在这种情况下确实发生了拆分，我认为它将无法正确拆分客户和服务运营商聊天。

我也尝试将名称合并到正则表达式类型的解决方案中：

df['log'].str.extract('([^.]*{}[^.]*)').format(df['log']))

但这也不起作用（因为我猜测.extract()不支持格式。

任何帮助，将不胜感激。

Answer 1

使用regex ， longs是您的第一段：

import re
re.match(r'^.*(?=\[)', longs).group()

结果：

"[14:40:48] You are joining a chat with James[14:40:48] James: Hello, I'm looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks"

您可以将此正则表达式功能打包到您的数据框中：

df['text_analysis'] = df['log'].apply(lambda x: re.match(r'^.*(?=\[)', x).group())

说明：正则表达式字符串'^.*(?=\\[)'意思是：从^开始，匹配任意数量的任何字符.* ，以[结尾，但不包括它(?=\\[) 。 由于regex匹配最长的字符串，因此它从开头到最后的[ ，不包含[ 。

可以通过以下方式提取单独的行：

import re
customerspeak = re.findall(r'(?<=\[(?:\d{2}:){2}\d{2}\]) James:[^\[]*', s)

输出：

[" James: Hello, I'm looking to find out more about the services and products you offer.",
 ' James: I would like to know more about your gardening and guttering service.',
 ' James: hello?',
 ' James: Thanks']

如果您希望它们在同一行中，则可以''.join(customerspeak)

正则表达式和熊猫：在名称匹配时提取部分字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-10-31 02:39:38

正则表达式和熊猫：在名称匹配时提取部分字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-10-31 02:39:38

解决方案1
0 已采纳 2018-10-31 02:39:38