I have a Pandas DataFrame which looks like so:
user_id item_timestamp item_cashtags item_sectors item_industries
0 406225 1483229353 SPY Financial Exchange Traded Fund
1 406225 1483229353 ERO Financial Exchange Traded Fund
2 406225 1483229350 CAKE|IWM|SDS|SPY|X|SPLK|QQQ Services|Financial|Financial|Financial|Basic M... Restaurants|Exchange Traded Fund|Exchange Trad...
3 619769 1483229422 AAPL Technology Personal Computers
4 692735 1483229891 IVOG Financial Exchange Traded Fund
I'd like to split the cashtags, sectors and industries columns by |
. Each cashtag corresponds to a sector which corresponds to an industry, so they are of equal amounts.
I'd like the output to be such that each cashtag, sector and industry have their own row, with the item_timestamp
and user_id
copying over, ie:
user_id item_timestamp item_cashtags item_sectors item_industries
2 406225 1483229350 CAKE|IWM|SDS Services|Financial|Financial Restaurants|Exchange Traded Fund|Exchange Traded Fund
would become:
user_id item_timestam item_cashtags item_sectors item_industries
406225 1483229350 CAKE Services Restaurants
406225 1483229350 IWM Financial Exchange Traded Fund
406225 1483229350 SDS Financial Exchange Traded Fund
My problem is that this is a conditional split which I'm not sure how to do in Pandas
If the frame is not to large, one easy option is to just loop through the rows. But I agree it is not the most pandamic
way to do it, and definitely not the most performing one.
from copy import copy
result = []
for idx, row in df.iterrows():
d = dict(row)
for cat1, cat2 in zip(d['cat1'].split('|'), d['cat2'].split('|')):
# here you can add an if to filter on certain categories
dd = copy(d)
dd['cat1'] = cat1
dd['cat2'] = cat2
result.append(dd)
pd.DataFrame(result) # convert back
Okay, I don't know how performant this will be, but here's another approach
# test_data
df_dict = {
"user_id": [406225, 406225],
"item_timestamp": [1483229350, 1483229353],
"item_cashtags": ["CAKE|IWM|SDS", "SPY"],
"item_sectors": ["Services|Financial|Financial", "Financial"],
"item_industries": [
"Restaurants|Exchange Traded Fund|Exchange Traded Fund",
"Exchange Traded Fund"
]
}
df = pd.DataFrame(df_dict)
# which columns to split; all others should be "copied" over
split_cols = ["item_cashtags", "item_sectors", "item_industries"]
copy_cols = [col for col in df.columns if col not in split_cols]
# for each column, split on |. This gives a list, so values is an array of lists
# summing values concatenates these into one long list
new_df_dict = {col: df[col].str.split("|").values.sum() for col in split_cols}
# n_splits tells us how many times to replicate the values from the copied columns
# so that they'll match with the new number of rows from splitting the other columns
n_splits = df.item_cashtags.str.count("\|") + 1
# we turn each value into a list so that we can easily replicate them the proper
# number of times, then concatenate these lists like with the split columns
for col in copy_cols:
new_df_dict[col] = (df[col].map(lambda x: [x]) * n_splits).values.sum()
# now make a df back from the dict of columns
new_df = pd.DataFrame(new_df_dict)
# new_df
# item_cashtags item_sectors item_industries user_id item_timestamp
# 0 CAKE Services Restaurants 406225 1483229350
# 1 IWM Financial Exchange Traded Fund 406225 1483229350
# 2 SDS Financial Exchange Traded Fund 406225 1483229350
# 3 SPY Financial Exchange Traded Fund 406225 1483229353
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.