Pandas，从列中的数组中为每一行填充 dataframe 列

Question

I have a large Pandas Dataframe with the following structure:我有一个大型 Pandas Dataframe 具有以下结构：

data = {'id': [3, 5, 9, 12], 
        'names': ["{name1,name2,name3}", "{name1,name3}", "{name1,name2}", "{name2,name1,name3}"],
        'values':["{N,Y,N}", "{N,N}", "{Y,N}", "{N,Y,Y}"]
       }

df = pd.DataFrame(data)
df

Note that the names are not always in the same order or always all included for each id , however the order of the values does correspond to the names as ordered for each row.请注意，名称并不总是以相同的顺序排列或总是包含在每个id中，但是值的顺序确实对应于为每行排序的名称。

I would like to turn this table into the following structure as efficiently as possible:我想尽可能高效地将这张表转换为以下结构：

data = {'id': [3, 5, 9, 12], 
        'name1': ["N", "N", "Y", "Y"],
        'name2': ["Y", " ", "N", "N"],
        'name3': ["N", "N", " ", "Y"],
       }

df = pd.DataFrame(data)
df

Currently I am accomplishing this with the following subroutine where I essentially go through the df row by row and create lists of the names and values and then add those values to new columns.目前我正在使用以下子例程完成此操作，其中我基本上是 go 通过df逐行创建名称和值的列表，然后将这些值添加到新列中。 This works correctly but it is very slow (estimated at ~14 hrs) since my df is large (~2e5 rows).这可以正常工作，但由于我的df很大（~2e5 行），它非常慢（估计在 ~14 小时）。 And each row or id can have up to 194 names, ie "{name1, name2, ..., name193, name194}" .并且每一行或id最多可以有 194 个名称，即"{name1, name2, ..., name193, name194}" 。

def add_name_cols(df, title_col, value_col):
    nRows = len(df)
    for index,row in df.iterrows(): # parse rows and replace characters
        title_spl = [ i for i in row[title_col].replace('{','').replace('}','').split(',') ]
        value_spl = [ i for i in row[value_col].replace('{','').replace('}','').split(',') ]
        i = 0
        for t in title_spl: # add value in correct column for this row
            print('Progress rows: {0:2.2f}%, Progress columns: {1:2.2f}%'.format(float(index)/float(nRows)*100, float(i)/float(194)*100), end='\r')
            df.loc[index,t] = value_spl[i]
            i += 1
    return df

df_new = add_name_cols(df, 'names', 'values')
df_new

Is there a way to accomplish this manipulation using more of Pandas' built-in methods that would expedite this process?有没有办法使用更多 Pandas 的内置方法来完成这种操作，从而加快这个过程？

Answer 1

Use string methods and dict constructor inside list comprehension.在列表理解中使用字符串方法和 dict 构造函数。

(i) Convert df[['names','values']] to a list of lists (i) 将df[['names','values']]转换为列表列表

(ii) iterate over each pair, ie row in df , and use str.strip and str.split to create pair of lists, unpack and cast to dict constructor (ii) 遍历每一对，即df中的行，并使用str.strip和str.split创建一对列表，解包并转换为dict构造函数

(iii) Pass the resulting list of dictionaries to pd.DataFrame (iii) 将得到的字典列表传递给pd.DataFrame

temp = pd.DataFrame([dict(zip(*[x.strip('{}').split(',') for x in pair])) for pair in df[['names','values']].to_numpy().tolist()]).fillna('')
df[temp.columns] = temp
df = df.drop(['names','values'], axis=1)

Output: Output：

   id name1 name2 name3
0   3     N     Y     N
1   5     N           N
2   9     Y     N      
3  12     Y     N     Y

Pandas，从列中的数组中为每一行填充 dataframe 列

问题描述

1 个解决方案

解决方案1
1 2022-01-07 01:23:31

Pandas，从列中的数组中为每一行填充 dataframe 列

问题描述

1 个解决方案

解决方案1 1 2022-01-07 01:23:31

解决方案1
1 2022-01-07 01:23:31