简体   繁体   中英

Split a Pandas column with lists of tuples into separate columns

I have data in a pandas dataframe and I'm trying to separate and extract data out of a specific column col . The values in col are all lists of various sizes that store 4-value tuples (previous 4 key-value dictionaries). These values are always in the same relative order for the tuple.

For each of those tuples, I'd like to have a separate row in the final dataframe as well as having the respective value from the tuple stored in a new column.

The DataFrame df looks like this:

ID    col
A     [(123, 456, 111, False), (124, 456, 111, true), (125, 456, 111, False)]
B     []
C     [(123, 555, 333, True)]

I need to split col into four columns but also lengthen the dataframe for each record so each tuple has its own row in df2 . DataFrame d2 should look like this:

ID   col1  col2  col3  col4
A    123   456   111   False
A    124   456   111   True
A    125   456   111   False
B    None  None  None  None
C    123   555   333   True

I have some sort of workaround loop-based code that seems to get the job done but I'd like to find a better and more efficient way that I can run on a huge data set. Perhaps using vectorization or NumPy if possible. Here's what I have so far:

import pandas as pd

df = pd.DataFrame({'ID': ['A', 'B', 'C'], 
                   'col': [[('123', '456', '111', False),
                            ('124', '456', '111', True),
                            ('125', '456', '111', False)],
                           [],
                           [('123', '555', '333', True)]]
                   })
final_rows = []

for index, row in df.iterrows():
    if not row.col:   # if list is empty
        final_rows.append(row.ID)
    for tup in row.col:
        new_row = [row.ID]
        vals = list(tup)
        new_row.extend(vals)
        final_rows.append(new_row)

df2 = pd.DataFrame(final_rows, columns=['ID', 'col1', 'col2', 'col3', 'col4'])

Here is another solution, you can try out using explode + concat

df_ = df.explode('col').reset_index(drop=True)

pd.concat(
    [df_[['ID']], pd.DataFrame(df_['col'].tolist()).add_prefix('col')], axis=1
)

  ID col0  col1  col2   col3
0  A  123   456   111  False
1  A  124   456   111   True
2  A  125   456   111  False
3  B  NaN  None  None   None
4  C  123   555   333   True

Try explode followed by apply ( pd.Series ) then merge back to the DataFrame:

import pandas as pd

df = pd.DataFrame({'ID': ['A', 'B', 'C'],
                   'col': [[('123', '456', '111', False),
                            ('124', '456', '111', True),
                            ('125', '456', '111', False)],
                           [],
                           [('123', '555', '333', True)]]
                   })
# Explode into Rows
new_df = df.explode('col').reset_index(drop=True)  

# Merge Back Together
new_df = new_df.merge(
    # Turn into Multiple Columns
    new_df['col'].apply(pd.Series),
    left_index=True,
    right_index=True) \
    .drop(columns=['col'])  # Drop Old Col Column

# Rename Columns
new_df.columns = ['ID', 'col1', 'col2', 'col3', 'col4']

# For Display
print(new_df)

Output:

  ID col1 col2 col3   col4
0  A  123  456  111  False
1  A  124  456  111   True
2  A  125  456  111  False
3  B  NaN  NaN  NaN    NaN
4  C  123  555  333   True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM