简体   繁体   中英

Pandas Function to Split multi-line text column into multiple columns

I have a column (stud_info) in the below format

stud_info = """Name: Mark
Address: 
PHX, AZ
Hobbies: 
1. Football
2. Programming
3. Squash"""

源数据

The column (stud_info) from raw data is stud_info which contains data as multiline text. I need to split it into 3 columns (Name, Address, and Hobbies). For a simple split, we can do it via lambda functions but this is a multiline split and the column names are also a part of the data. (ie the text Name, Address, and Hobbies should not be a part of the columns). The final columns should look like

最终数据

Please suggest a way to do it using pandas.

Given:

df = pd.DataFrame({'stud_info': {0: 'Name: Mark\nAddress: \nPHX, AZ\nHobbies: \n1. Football\n2. Programming\n3. Squash'}})

We can define a Regex Expression for your particular formatting, and use the pd.Series.str.extract method to break the groups into different columns. For an explanation of the pattern see Regexr .

import re

pattern = 'Name:\s(.+)\nAddress:\s\n(.+)\nHobbies:\s\n(.+)'
# We need flags=re.DOTALL to allow the final group to encompass multiple lines.
df[['Name', 'Address', 'Hobbies']] = df.stud_info.str.extract(pattern, flags=re.DOTALL)
print(df[['Name', 'Address', 'Hobbies']])

Output:

   Name  Address                                 Hobbies
0  Mark  PHX, AZ  1. Football\n2. Programming\n3. Squash

My solution:

import pandas as pd 
import re

txt = """Name: Mark
Address: 
PHX, AZ
Hobbies: 
1. Football
2. Programming
3. Squash"""

pattern = re.compile('Name:\s(.+)\nAddress:\s\n(.+)\nHobbies:\s\n([\w\W]*)')

re_match = pattern.match(txt)
df = pd.DataFrame([list(re_match.groups())], columns=['Name', 'Address', 'Hobbies'])
df

Output:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM