![](/img/trans.png)
[英]Creating a new dataframe from a pyspark dataframe column efficiently
[英]Creating a new column and rows in dataframe pyspark
我有一個這樣的數據框
id_1 id_desc cat_1 cat_2
111 ask ele phone
222 ask hr ele phone
333 ask hr dk ele phone
444 askh ele phone
如果cat_1
, cat_2
對於多個id_1
相同,則需要將該關聯捕獲為新列。
需要這樣的輸出,
id_1 id_desc cat_1 cat_2 id_2
111 ask ele phone 222
111 ask ele phone 333
111 ask ele phone 444
222 ask hr ele phone 111
222 ask hr ele phone 333
222 ask hr ele phone 444
333 ask hr dk ele phone 111
333 ask hr dk ele phone 222
333 ask hr dk ele phone 444
如何在python中做到這一點?
我無法提出任何特別優雅的方法,但這應該可以完成工作:
import pandas as pd
import numpy as np
df = pd.DataFrame([[111, 'ask', 'ele', 'phone'],
[222, 'ask_hr', 'ele', 'phone'],
[333, 'ask_hr_dk', 'ele', 'phone'],
[444, 'askh', 'ele', 'phone']],
columns=['id_1', 'id_desc', 'cat_1', 'cat_2'])
grouped = df.groupby(by=['cat_1', 'cat_2']) # group by the columns you want to be identical
data = [] # a list to store all unique groups
# In your example, this loop is not needed, but this generalizes to more than 1 pair
# of cat_1 and cat_2 values
for group in grouped.groups:
n_rows = grouped.get_group(group).shape[0] # how many unique id's in a group
all_data = np.tile(grouped.get_group(group).values, (n_rows, 1)) # tile the data n_row times
ids = np.repeat(grouped.get_group(group)['id_1'].values, n_rows) # repeat the ids n_row times
data += [np.c_[all_data, ids]] # concat the two sets of data and add to list
df_2 = pd.DataFrame(np.concatenate(data), columns=['id_1', 'id_desc', 'cat_1', 'cat_2', 'id_2'])
基本思想是按cat_1
和cat_2
列對數據進行cat_1
(使用groupby
),使用np.tile
創建每個組的副本的次數要id_1
該組中id_1
唯一值,然后將結果與唯一id_1
值(每組數據一個值)。
如果您不希望id_1
與id_2
相同,則只需選擇不匹配的行:
df_2 = df_2[df_2['id_1'] != df_2['id_2']]
如果希望它們在id_1
上id_1
:
df_2.sort_values('id_1', inplace=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.