I have a column with data like this
Ticket NO: 123456789; Location ID:ABC123; Type:Network;
Ticket No. 132123456, Location ID:ABC444; Type:App
Tickt#222256789; Location ID:AMC121; Type:Network;
I am trying like this
new = data["Description"].str.split(";", n = 1, expand = True)
data["Ticket"]= new[0]
data["Location"]= new[1]
data["Type"]= new[2]
# Dropping old columns
data.drop(columns =["Description"], inplace = True)
I can separate based on ";"but how to do for both ";" and ","?
A more general solution, that allows you to perform as much processing as you like comfortably. Let's start by defining an example dataframe for easy debugging:
df = pd.DataFrame({'Description': [
'Ticket NO: 123456789 , Location ID:ABC123; Type:Network;',
'Ticket NO: 123456789 ; Location ID:ABC123; Type:Network;']})
Then, let's define our processing function, where you can do anything you like:
def process(row):
parts = re.split(r'[,;]', row)
return pd.Series({'Ticket': parts[0], 'Location': parts[1], 'Type': parts[2]})
In addition to splitting by ,;
and then separating into the 3 sections, you can add code that will strip whitespace characters, remove whatever is on the left of the colons etc. For example, try:
def process(row):
parts = re.split(r'[,;]', row)
data = {}
for part in parts:
for field in ['Ticket', 'Location', 'Type']:
if field.lower() in part.lower():
data[field] = part.split(':')[1].strip()
return pd.Series(data)
Finally, apply to get the result:
df['Description'].apply(process)
This is much more readable and easily maintainable than doing everything in a single regex, especially as you might end up needing additional processing.
The output of this application will look like this:
To add this output to the original dataframe, simply run:
df[['Ticket', 'Location', 'Type']] = df['Description'].apply(process)
You can use
new = data["Description"].str.split("[;,]", n = 2, expand = True)
new.columns = ['Ticket', 'Location', 'Type']
Output:
>>> new
Ticket Location Type
0 Ticket NO: 123456789 Location ID:ABC123 Type:Network;
1 Ticket No. 132123456 Location ID:ABC444 Type:App
2 Tickt#222256789 Location ID:AMC121 Type:Network;
The [;,]
regex matches either a ;
or a ,
char, and n=2
sets max split to two times.
Another regex Series.str.extract
solution:
new[['Ticket', 'Location', 'Type']] = data['Description'].str.extract(r"(?i)Ticke?t\D*(\d+)\W*Location ID\W*(\w+)\W*Type:(\w+)")
>>> new
Ticket Location Type
0 123456789 ABC123 Network
1 132123456 ABC444 App
2 222256789 AMC121 Network
>>>
See the regex demo . Details :
(?i)
- case insensitive flag Ticke?t
- Ticket
with an optional e
\D*
- zero or more non-digit chars (\d+)
- Group 1: one or more digits \W*
- zero or more non-word chars Location ID
- a string \W*
- zero or more non-word chars (\w+)
- Group 2: one or more word chars \W*
- zero or more non-word chars Type:
- a string (\w+)
- Group 3: one or more word chars One approach using str.extract
Ex:
df[['Ticket', 'Location', 'Type']] = df['Description'].str.extract(r"[Ticket\sNO:.#](\d+).*ID:([A-Z0-9]+).*Type:([A-Za-z]+)", flags=re.I)
print(df[['Ticket', 'Location', 'Type']])
Output:
Ticket Location Type
0 123456789 ABC123 Network
1 132123456 ABC444 App
2 222256789 AMC121 Network
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.