I have a dataframe like this
id_1 id_desc cat_1 cat_2
111 ask ele phone
222 ask hr ele phone
333 ask hr dk ele phone
444 askh ele phone
If cat_1
, cat_2
are same for multiple id_1
, that association need to be captured as a new column.
Need an output like this,
id_1 id_desc cat_1 cat_2 id_2
111 ask ele phone 222
111 ask ele phone 333
111 ask ele phone 444
222 ask hr ele phone 111
222 ask hr ele phone 333
222 ask hr ele phone 444
333 ask hr dk ele phone 111
333 ask hr dk ele phone 222
333 ask hr dk ele phone 444
how to get this done in python?
I wasn't able to come up with anything particularly elegant, but this should get the job done:
import pandas as pd
import numpy as np
df = pd.DataFrame([[111, 'ask', 'ele', 'phone'],
[222, 'ask_hr', 'ele', 'phone'],
[333, 'ask_hr_dk', 'ele', 'phone'],
[444, 'askh', 'ele', 'phone']],
columns=['id_1', 'id_desc', 'cat_1', 'cat_2'])
grouped = df.groupby(by=['cat_1', 'cat_2']) # group by the columns you want to be identical
data = [] # a list to store all unique groups
# In your example, this loop is not needed, but this generalizes to more than 1 pair
# of cat_1 and cat_2 values
for group in grouped.groups:
n_rows = grouped.get_group(group).shape[0] # how many unique id's in a group
all_data = np.tile(grouped.get_group(group).values, (n_rows, 1)) # tile the data n_row times
ids = np.repeat(grouped.get_group(group)['id_1'].values, n_rows) # repeat the ids n_row times
data += [np.c_[all_data, ids]] # concat the two sets of data and add to list
df_2 = pd.DataFrame(np.concatenate(data), columns=['id_1', 'id_desc', 'cat_1', 'cat_2', 'id_2'])
The basic idea is to group your data by the cat_1
and cat_2
columns (using groupby
), use np.tile
to create copies of each group as many times as there are unique values of id_1
in the group, and concatenate the result with the unique id_1
values (one value per group of data).
If you don't want id_1
to ever be the same as id_2
, just select the rows where they don't match:
df_2 = df_2[df_2['id_1'] != df_2['id_2']]
And if you want them sorted on id_1
:
df_2.sort_values('id_1', inplace=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.