简体   繁体   English

Pandas,从列中的数组中为每一行填充 dataframe 列

[英]Pandas, populate dataframe columns for each row from array in column

I have a large Pandas Dataframe with the following structure:我有一个大型 Pandas Dataframe 具有以下结构:

data = {'id': [3, 5, 9, 12], 
        'names': ["{name1,name2,name3}", "{name1,name3}", "{name1,name2}", "{name2,name1,name3}"],
        'values':["{N,Y,N}", "{N,N}", "{Y,N}", "{N,Y,Y}"]
       }

df = pd.DataFrame(data)
df

Note that the names are not always in the same order or always all included for each id , however the order of the values does correspond to the names as ordered for each row.请注意,名称并不总是以相同的顺序排列或总是包含在每个id中,但是值的顺序确实对应于为每行排序的名称。

I would like to turn this table into the following structure as efficiently as possible:我想尽可能高效地将这张表转换为以下结构:

data = {'id': [3, 5, 9, 12], 
        'name1': ["N", "N", "Y", "Y"],
        'name2': ["Y", " ", "N", "N"],
        'name3': ["N", "N", " ", "Y"],
       }

df = pd.DataFrame(data)
df

Currently I am accomplishing this with the following subroutine where I essentially go through the df row by row and create lists of the names and values and then add those values to new columns.目前我正在使用以下子例程完成此操作,其中我基本上是 go 通过df逐行创建名称和值的列表,然后将这些值添加到新列中。 This works correctly but it is very slow (estimated at ~14 hrs) since my df is large (~2e5 rows).这可以正常工作,但由于我的df很大(~2e5 行),它非常慢(估计在 ~14 小时)。 And each row or id can have up to 194 names, ie "{name1, name2, ..., name193, name194}" .并且每一行或id最多可以有 194 个名称,即"{name1, name2, ..., name193, name194}"

def add_name_cols(df, title_col, value_col):
    nRows = len(df)
    for index,row in df.iterrows(): # parse rows and replace characters
        title_spl = [ i for i in row[title_col].replace('{','').replace('}','').split(',') ]
        value_spl = [ i for i in row[value_col].replace('{','').replace('}','').split(',') ]
        i = 0
        for t in title_spl: # add value in correct column for this row
            print('Progress rows: {0:2.2f}%, Progress columns: {1:2.2f}%'.format(float(index)/float(nRows)*100, float(i)/float(194)*100), end='\r')
            df.loc[index,t] = value_spl[i]
            i += 1
    return df

df_new = add_name_cols(df, 'names', 'values')
df_new

Is there a way to accomplish this manipulation using more of Pandas' built-in methods that would expedite this process?有没有办法使用更多 Pandas 的内置方法来完成这种操作,从而加快这个过程?

Use string methods and dict constructor inside list comprehension.在列表理解中使用字符串方法和 dict 构造函数。

(i) Convert df[['names','values']] to a list of lists (i) 将df[['names','values']]转换为列表列表

(ii) iterate over each pair, ie row in df , and use str.strip and str.split to create pair of lists, unpack and cast to dict constructor (ii) 遍历每一对,即df中的行,并使用str.stripstr.split创建一对列表,解包并转换为dict构造函数

(iii) Pass the resulting list of dictionaries to pd.DataFrame (iii) 将得到的字典列表传递给pd.DataFrame

temp = pd.DataFrame([dict(zip(*[x.strip('{}').split(',') for x in pair])) for pair in df[['names','values']].to_numpy().tolist()]).fillna('')
df[temp.columns] = temp
df = df.drop(['names','values'], axis=1)

Output: Output:

   id name1 name2 name3
0   3     N     Y     N
1   5     N           N
2   9     Y     N      
3  12     Y     N     Y

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据条件和前一行值从其他列填充 Pandas Dataframe 列 - Populate Pandas Dataframe column from other columns based on a condition and previous row value 删除列并为每个删除的列创建唯一的行 Pandas Dataframe - Remove Columns And Create Unique Row For Each Removed Column Pandas Dataframe 如何将 dataframe 中的每一列与另一个 dataframe pandas 的行相乘? - How to multiply each column in a dataframe with a row from another dataframe pandas? 根据行值使用其他列的名称填充新的 Pandas 数据框列 - Populate a new pandas dataframe column with names of other columns based on their row value 为pandas数据帧中的每一行组合多个列 - Combine multiple columns for each row in pandas dataframe 汇总数据框每一行的列,并在多级索引熊猫数据框中添加新列 - Sum columns for each row of dataframe, and add new column in multi level index pandas dataframe Python Pandas DataFrame 将每一行的列转换为单个列作为 Pandas 系列 - Python Pandas DataFrame convert columns of each row to one single column as Pandas Series Pandas 根据另一个数据框中的匹配列填充新的数据框列 - Pandas populate new dataframe column based on matching columns in another dataframe 使用Pandas DataFrame中的列值逐行填充字符串 - Use column values in Pandas DataFrame to populate string row by row Pandas,Dataframe,每行的条件总和 - Pandas, Dataframe, conditional sum of column for each row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM