简体   繁体   English

正则表达式和熊猫:在名称匹配时提取部分字符串

[英]Regex and pandas: extract partial string on name match

I have a pandas data frame containing instances of web chat between two people, the customer and the service desk operator. 我有一个pandas数据框,其中包含两个人(客户和服务台操作员)之间的网络聊天实例。

The customers name is always announced in the first line of the web chat as the customer enters the conversation. 当客户进入对话时,客户名称总是在网络聊天的第一行中宣布。

Example 1: 范例1:

In: df['log'][0] 在: df['log'][0]

Out: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session. 离开: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session. [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session.

Example 2: 范例2:

In: df['log'][1] 在: df['log'][1]

Out: [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session. 出: [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session. [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session.

The names in the chat always vary as different customers use the web chat service. 由于不同的客户使用网络聊天服务,因此聊天中的名称始终会有所不同。

A customer can enter chat having one or more names. 客户可以输入具有一个或多个名称的聊天。 Example: James Ravi Roy Andrews . 例如: James Ravi Roy Andrews

Requirements: 要求:

I would like to separate all instances of customer chat (eg chat by James and Roy Andrews ) from the df['log'] column into a new column df[text_analysis] . 我想将所有客户聊天实例(例如JamesRoy Andrews聊天)从df['log']列分离到新列df[text_analysis]

From example 1 above this would look like: 从上面的示例1中可以看到:

In: df['text_analysis][0] 于: df['text_analysis][0]

Out: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:44:12] James: Thanks 离开: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:44:12] James: Thanks

EDIT: The optimal solution would extract the substrings as provided in the example above and omit the final time stamp [14:44:38] James has exited the session. 编辑:最佳解决方案将提取上面示例中提供的子字符串,并省略最终时间戳[14:44:38] James has exited the session. .

What I have tried so far: I have extracted the customer names from the df['log'] column into a new column called df['names'] using: 到目前为止,我已经尝试过:我已经使用以下方法将df['log']列中的客户名称提取到了一个名为df['names']的新列中:

df['names'] = df['log'].apply(lambda x: x.split(' ')[7].split('[')[0])

I wanted to use the names in the df['names'] column to use in a str.split() pandas function -- something along the lines of: 我想使用df['names']列中df['names']str.split() pandas函数中使用-类似以下内容:

df['log'].str.split(df['names']) however this does not work and if the split did occur under this scenario I think it would not properly split the customer and service operator chats apart. df['log'].str.split(df['names'])但是不起作用,如果在这种情况下确实发生了拆分,我认为它将无法正确拆分客户和服务运营商聊天。

Also I have tried incorporating the names into a regex type solution: 我也尝试将名称合并到正则表达式类型的解决方案中:

df['log'].str.extract('([^.]*{}[^.]*)').format(df['log']))

But this does not work either (because I'm guessing that .extract() does not support format. 但这也不起作用(因为我猜测.extract()不支持格式。

Any help would be appreciated. 任何帮助,将不胜感激。

Use regex , longs is your first paragraph: 使用regexlongs是您的第一段:

import re
re.match(r'^.*(?=\[)', longs).group()

Result: 结果:

"[14:40:48] You are joining a chat with James[14:40:48] James: Hello, I'm looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks"

You can package this regex function into your dataframe: 您可以将此正则表达式功能打包到您的数据框中:

df['text_analysis'] = df['log'].apply(lambda x: re.match(r'^.*(?=\[)', x).group())

Explanations: regex string '^.*(?=\\[)' means: from beginning ^ , match any number of any character .* , ends with [ but do not include it (?=\\[) . 说明:正则表达式字符串'^.*(?=\\[)'意思是:从^开始,匹配任意数量的任何字符.* ,以[结尾,但不包括它(?=\\[) Since regex matches the longest string this will go from the beginning till the last [ , and does not include [ . 由于regex匹配最长的字符串,因此它从开头到最后的[ ,不包含[

Individual lines can be extracted this way: 可以通过以下方式提取单独的行:

import re
customerspeak = re.findall(r'(?<=\[(?:\d{2}:){2}\d{2}\]) James:[^\[]*', s)

output: 输出:

[" James: Hello, I'm looking to find out more about the services and products you offer.",
 ' James: I would like to know more about your gardening and guttering service.',
 ' James: hello?',
 ' James: Thanks']

If you want these in the same line, you can ''.join(customerspeak) 如果您希望它们在同一行中,则可以''.join(customerspeak)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM