简体   繁体   中英

Splitting column by multiple custom delimiters in Python

I need to split a column called Creative where each cell contains samples such as:

pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)

Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:

pn io ta pt cn cs
2021 302 Yes Blue John Doe

I tried

 df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)

and

df['Creative Size'] = df['Creative Size'].str.replace(')','')

but got an error, error: missing ), unterminated subpattern at position 2 , assuming it has something to do with regular expressions.

Is there an easy way to split these ? Thanks.

Use extract with named capturing groups (see here ):

import pandas as pd

# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])

# extract with a named capturing group
res = df["Creative"].str.extract(
    r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
    expand=True)

print(res)

Output

     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe

I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:

import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
    df[col] = subtable[col].values

Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!

David

Try with extractall :

names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)

>>> output
     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe
1  2020  301   No   Red  Jane  Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)", 
                                "pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})

We can use str.findall to extract matching column name-value pairs

pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))

     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe

Using regular expressions, different way of packaging final DataFrame:

import re
import pandas as pd

txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'

data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM