I have data in a pandas dataframe and I'm trying to separate and extract data out of a specific column col
. The values in col
are all lists of various sizes that store 4-value tuples (previous 4 key-value dictionaries). These values are always in the same relative order for the tuple.
For each of those tuples, I'd like to have a separate row in the final dataframe as well as having the respective value from the tuple stored in a new column.
The DataFrame df
looks like this:
ID col
A [(123, 456, 111, False), (124, 456, 111, true), (125, 456, 111, False)]
B []
C [(123, 555, 333, True)]
I need to split col
into four columns but also lengthen the dataframe for each record so each tuple has its own row in df2
. DataFrame d2
should look like this:
ID col1 col2 col3 col4
A 123 456 111 False
A 124 456 111 True
A 125 456 111 False
B None None None None
C 123 555 333 True
I have some sort of workaround loop-based code that seems to get the job done but I'd like to find a better and more efficient way that I can run on a huge data set. Perhaps using vectorization or NumPy
if possible. Here's what I have so far:
import pandas as pd
df = pd.DataFrame({'ID': ['A', 'B', 'C'],
'col': [[('123', '456', '111', False),
('124', '456', '111', True),
('125', '456', '111', False)],
[],
[('123', '555', '333', True)]]
})
final_rows = []
for index, row in df.iterrows():
if not row.col: # if list is empty
final_rows.append(row.ID)
for tup in row.col:
new_row = [row.ID]
vals = list(tup)
new_row.extend(vals)
final_rows.append(new_row)
df2 = pd.DataFrame(final_rows, columns=['ID', 'col1', 'col2', 'col3', 'col4'])
Here is another solution, you can try out using explode
+ concat
df_ = df.explode('col').reset_index(drop=True)
pd.concat(
[df_[['ID']], pd.DataFrame(df_['col'].tolist()).add_prefix('col')], axis=1
)
ID col0 col1 col2 col3
0 A 123 456 111 False
1 A 124 456 111 True
2 A 125 456 111 False
3 B NaN None None None
4 C 123 555 333 True
Try explode followed by apply ( pd.Series ) then merge back to the DataFrame:
import pandas as pd
df = pd.DataFrame({'ID': ['A', 'B', 'C'],
'col': [[('123', '456', '111', False),
('124', '456', '111', True),
('125', '456', '111', False)],
[],
[('123', '555', '333', True)]]
})
# Explode into Rows
new_df = df.explode('col').reset_index(drop=True)
# Merge Back Together
new_df = new_df.merge(
# Turn into Multiple Columns
new_df['col'].apply(pd.Series),
left_index=True,
right_index=True) \
.drop(columns=['col']) # Drop Old Col Column
# Rename Columns
new_df.columns = ['ID', 'col1', 'col2', 'col3', 'col4']
# For Display
print(new_df)
Output:
ID col1 col2 col3 col4
0 A 123 456 111 False
1 A 124 456 111 True
2 A 125 456 111 False
3 B NaN NaN NaN NaN
4 C 123 555 333 True
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.