Splitting column by multiple custom delimiters in Python

Question

I need to split a column called Creative where each cell contains samples such as:

pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)

Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:

pn	io	ta	pt	cn	cs
2021	302	Yes	Blue	John	Doe

I tried

 df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)

and

df['Creative Size'] = df['Creative Size'].str.replace(')','')

but got an error, error: missing ), unterminated subpattern at position 2 , assuming it has something to do with regular expressions.

Is there an easy way to split these ? Thanks.

Answer 1

Use extract with named capturing groups (see here ):

import pandas as pd

# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])

# extract with a named capturing group
res = df["Creative"].str.extract(
    r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
    expand=True)

print(res)

Output

     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe

Answer 2

I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:

import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
    df[col] = subtable[col].values

Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!

David

Answer 3

Try with extractall :

names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)

>>> output
     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe
1  2020  301   No   Red  Jane  Doe

Input df:

df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)", 
                                "pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})

Answer 4

We can use str.findall to extract matching column name-value pairs

pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))

     pn   io   ta    pt    cn   cs
0  2021  302  Yes  Blue  John  Doe

Answer 5

Using regular expressions, different way of packaging final DataFrame:

import re
import pandas as pd

txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'

data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

Splitting column by multiple custom delimiters in Python

Question

5 answers

solution1
0 2021-10-19 15:09:06

solution2
0 2021-10-19 15:09:31

solution3
0 2021-10-19 15:20:47

Input df:

solution4
0 2021-10-19 15:23:43

solution5
0 2021-10-19 15:24:17

Splitting column by multiple custom delimiters in Python

Question

5 answers

solution1 0 2021-10-19 15:09:06

solution2 0 2021-10-19 15:09:31

solution3 0 2021-10-19 15:20:47

Input df:

solution4 0 2021-10-19 15:23:43

solution5 0 2021-10-19 15:24:17

solution1
0 2021-10-19 15:09:06

solution2
0 2021-10-19 15:09:31

solution3
0 2021-10-19 15:20:47

solution4
0 2021-10-19 15:23:43

solution5
0 2021-10-19 15:24:17