简体   繁体   中英

Creating a new column and rows in dataframe pyspark

I have a dataframe like this

id_1    id_desc    cat_1    cat_2
111      ask        ele     phone
222      ask hr     ele     phone
333      ask hr dk  ele     phone
444      askh       ele     phone

If cat_1 , cat_2 are same for multiple id_1 , that association need to be captured as a new column.

Need an output like this,

id_1    id_desc        cat_1    cat_2   id_2
111      ask             ele    phone   222
111      ask             ele    phone   333
111      ask             ele    phone   444
222      ask hr          ele    phone   111
222      ask hr          ele    phone   333
222      ask hr          ele    phone   444
333      ask hr dk       ele    phone   111
333      ask hr dk       ele    phone   222
333      ask hr dk       ele    phone   444

how to get this done in python?

I wasn't able to come up with anything particularly elegant, but this should get the job done:

import pandas as pd
import numpy as np

df = pd.DataFrame([[111, 'ask', 'ele', 'phone'], 
                   [222, 'ask_hr', 'ele', 'phone'], 
                   [333, 'ask_hr_dk', 'ele', 'phone'], 
                   [444, 'askh', 'ele', 'phone']], 
                   columns=['id_1', 'id_desc', 'cat_1', 'cat_2'])

grouped = df.groupby(by=['cat_1', 'cat_2'])  # group by the columns you want to be identical

data = []  # a list to store all unique groups

# In your example, this loop is not needed, but this generalizes to more than 1 pair
# of cat_1 and cat_2 values
for group in grouped.groups:  
    n_rows = grouped.get_group(group).shape[0]  # how many unique id's in a group
    all_data = np.tile(grouped.get_group(group).values, (n_rows, 1))  # tile the data n_row times
    ids = np.repeat(grouped.get_group(group)['id_1'].values, n_rows)  # repeat the ids n_row times
    data += [np.c_[all_data, ids]]  # concat the two sets of data and add to list

df_2 = pd.DataFrame(np.concatenate(data), columns=['id_1', 'id_desc', 'cat_1', 'cat_2', 'id_2'])

The basic idea is to group your data by the cat_1 and cat_2 columns (using groupby ), use np.tile to create copies of each group as many times as there are unique values of id_1 in the group, and concatenate the result with the unique id_1 values (one value per group of data).

If you don't want id_1 to ever be the same as id_2 , just select the rows where they don't match:

df_2 = df_2[df_2['id_1'] != df_2['id_2']] 

And if you want them sorted on id_1 :

df_2.sort_values('id_1', inplace=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM