简体   繁体   中英

Split column based on input string into multiple columns in pandas python

I have below pandas data frame and I am trying to split col1 into multiple columns based on split_format string.

Inputs:

split_format = 'id-id1_id2|id3'

data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
        'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data).style.hide_index()
df

col1        col2
a-a1_a2|a3   20
b-b1_b2|b3   21
c-c1_c2|c3   19
d-d1_d2|d3   18

Expected Output:

id  id1 id2 id3 col2
 a   a1  a2  a3  20
 b   b1  b2  b3  21
 c   c1  c2  c3  19
 d   d1  d2  d3  18

**Note: The special characters and column name in split_string can be changed.

I think I am able to figure it out.

col_name = re.split('[^0-9a-zA-Z]+',split_format)
df[col_name] = df['col1'].str.split('[^0-9a-zA-Z]+',expand=True)
del df['col1']
df



   col2 id  id1 id2 id3
0   20  a   a1  a2  a3
1   21  b   b1  b2  b3
2   19  c   c1  c2  c3
3   18  d   d1  d2  d3

I parse the symbols and then recursively evaluate the resulting strings from the token split on the string. I flatten the resulting list and their recursive evaluate the resulting list until all the symbols have been evaluated.

 split_format = 'id-id1_id2|id3'

 data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
    'col2':[20, 21, 19, 18]}
 df = pd.DataFrame(data)

symbols=[]
for x in split_format:
    if x.isalnum()==False:
        symbols.append(x)

result=[]
def parseTree(stringlist,symbols,result):

    #print("String list",stringlist)

    if len(symbols)==0:
        [result.append(x) for x in stringlist]
        return
    token=symbols.pop(0)
    elements=[]
    for item in stringlist:
        elements.append(item.split(token))
    
    flat_list = [item for sublist in elements for item in sublist]        
    parseTree(flat_list,symbols,result)

df2=pd.DataFrame(columns=["id","id1","id2","id3"])
for key, item in df.iterrows():
    symbols2=symbols.copy()
    value=item['col1']
    parseTree([value],symbols2,result)
    a_series = pd. Series(result, index = df2.columns)
    df2=df2.append(a_series, ignore_index=True)
    result.clear()

df2['col2']=df['col2']    
print(df2)

output:

  id id1 id2 id3  col2
0  a  a1  a2  a3    20
1  b  b1  b2  b3    21
2  c  c1  c2  c3    19
3  d  d1  d2  d3    18

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM