简体   繁体   中英

Regex and pandas: extract partial string on name match

I have a pandas data frame containing instances of web chat between two people, the customer and the service desk operator.

The customers name is always announced in the first line of the web chat as the customer enters the conversation.

Example 1:

In: df['log'][0]

Out: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session. [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session.

Example 2:

In: df['log'][1]

Out: [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session. [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session.

The names in the chat always vary as different customers use the web chat service.

A customer can enter chat having one or more names. Example: James Ravi Roy Andrews .

Requirements:

I would like to separate all instances of customer chat (eg chat by James and Roy Andrews ) from the df['log'] column into a new column df[text_analysis] .

From example 1 above this would look like:

In: df['text_analysis][0]

Out: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\\'m looking to find out more about the services and products you offer.[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:44:12] James: Thanks

EDIT: The optimal solution would extract the substrings as provided in the example above and omit the final time stamp [14:44:38] James has exited the session. .

What I have tried so far: I have extracted the customer names from the df['log'] column into a new column called df['names'] using:

df['names'] = df['log'].apply(lambda x: x.split(' ')[7].split('[')[0])

I wanted to use the names in the df['names'] column to use in a str.split() pandas function -- something along the lines of:

df['log'].str.split(df['names']) however this does not work and if the split did occur under this scenario I think it would not properly split the customer and service operator chats apart.

Also I have tried incorporating the names into a regex type solution:

df['log'].str.extract('([^.]*{}[^.]*)').format(df['log']))

But this does not work either (because I'm guessing that .extract() does not support format.

Any help would be appreciated.

Use regex , longs is your first paragraph:

import re
re.match(r'^.*(?=\[)', longs).group()

Result:

"[14:40:48] You are joining a chat with James[14:40:48] James: Hello, I'm looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks"

You can package this regex function into your dataframe:

df['text_analysis'] = df['log'].apply(lambda x: re.match(r'^.*(?=\[)', x).group())

Explanations: regex string '^.*(?=\\[)' means: from beginning ^ , match any number of any character .* , ends with [ but do not include it (?=\\[) . Since regex matches the longest string this will go from the beginning till the last [ , and does not include [ .

Individual lines can be extracted this way:

import re
customerspeak = re.findall(r'(?<=\[(?:\d{2}:){2}\d{2}\]) James:[^\[]*', s)

output:

[" James: Hello, I'm looking to find out more about the services and products you offer.",
 ' James: I would like to know more about your gardening and guttering service.',
 ' James: hello?',
 ' James: Thanks']

If you want these in the same line, you can ''.join(customerspeak)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM