簡體   English   中英

在數據框pyspark中創建新的列和行

[英]Creating a new column and rows in dataframe pyspark

我有一個這樣的數據框

id_1    id_desc    cat_1    cat_2
111      ask        ele     phone
222      ask hr     ele     phone
333      ask hr dk  ele     phone
444      askh       ele     phone

如果cat_1cat_2對於多個id_1相同,則需要將該關聯捕獲為新列。

需要這樣的輸出,

id_1    id_desc        cat_1    cat_2   id_2
111      ask             ele    phone   222
111      ask             ele    phone   333
111      ask             ele    phone   444
222      ask hr          ele    phone   111
222      ask hr          ele    phone   333
222      ask hr          ele    phone   444
333      ask hr dk       ele    phone   111
333      ask hr dk       ele    phone   222
333      ask hr dk       ele    phone   444

如何在python中做到這一點?

我無法提出任何特別優雅的方法,但這應該可以完成工作:

import pandas as pd
import numpy as np

df = pd.DataFrame([[111, 'ask', 'ele', 'phone'], 
                   [222, 'ask_hr', 'ele', 'phone'], 
                   [333, 'ask_hr_dk', 'ele', 'phone'], 
                   [444, 'askh', 'ele', 'phone']], 
                   columns=['id_1', 'id_desc', 'cat_1', 'cat_2'])

grouped = df.groupby(by=['cat_1', 'cat_2'])  # group by the columns you want to be identical

data = []  # a list to store all unique groups

# In your example, this loop is not needed, but this generalizes to more than 1 pair
# of cat_1 and cat_2 values
for group in grouped.groups:  
    n_rows = grouped.get_group(group).shape[0]  # how many unique id's in a group
    all_data = np.tile(grouped.get_group(group).values, (n_rows, 1))  # tile the data n_row times
    ids = np.repeat(grouped.get_group(group)['id_1'].values, n_rows)  # repeat the ids n_row times
    data += [np.c_[all_data, ids]]  # concat the two sets of data and add to list

df_2 = pd.DataFrame(np.concatenate(data), columns=['id_1', 'id_desc', 'cat_1', 'cat_2', 'id_2'])

基本思想是按cat_1cat_2列對數據進行cat_1 (使用groupby ),使用np.tile創建每個組的副本的次數要id_1該組中id_1唯一值,然后將結果與唯一id_1值(每組數據一個值)。

如果您不希望id_1id_2相同,則只需選擇不匹配的行:

df_2 = df_2[df_2['id_1'] != df_2['id_2']] 

如果希望它們在id_1id_1

df_2.sort_values('id_1', inplace=True)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM