简体   繁体   English

根据其他两个数据帧填充新的数据帧

[英]Filling in a new data frame based on two other data frames

I want an efficient way to solve this problem below because my code seems inefficient. 我想在下面解决此问题的有效方法,因为我的代码效率低下。

First of all, let me provide a dummy dataset. 首先,让我提供一个虚拟数据集。

import numpy as np
import pandas as pd    
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

df1= {'a0' : [1,2,2,1,3], 'a1' : [2,3,3,2,4], 'a2' : [3,4,4,3,5], 'a3' : [4,5,5,4,6], 'a4' : [5,6,6,5,7]}

df2 = {'b0' : [3,6,6,3,8], 'b1' : [6,8,8,6,9], 'b2' : [8,9,9,8,7], 'b3' : [9,7,7,9,2], 'b4' : [7,2,2,7,1]}

df1 = pd.DataFrame(df1)

df2 = pd.DataFrame(df2)

My actual dataset has more than 100,000 rows and 15 columns. 我的实际数据集有超过100,000行和15列。 Now, what I want to do is pretty complicated to explain, but here we go. 现在,我想做的事情解释起来很复杂,但是现在我们开始吧。

Goal: I want to create a new df using the two dfs above. 目标:我想使用上面的两个df创建一个新的df。

1) find the global min and max from df1. 1)从df1中找到全局最小值和最大值。 Since the value is sorted by row, column 'a' will always have the minimum each row, and 'e' will have the maximum. 由于该值是按行排序的,所以“ a”列始终每行最少,而“ e”列却最大。 Therefore, I will find the minimum in column 'a0' and maximum in 'a4'. 因此,我将在“ a0”列中找到最小值,在“ a4”列中找到最大值。

Min = df1['a0'].min()
Max = df1['a4'].max()

Min
Max

2) Then I will create a data frame filled with 0s and columns of range(Min, Max). 2)然后,我将创建一个数据框,其中填充了0和范围(最小,最大)的列。 In this case, 1 through 7. 在这种情况下,请输入1到7。

column = []
for i in np.arange(Min, Max+1):
    column.append(i)

newdf = pd.DataFrame(0, index = df1.index, columns=column)

3) The third step is to find the place where the values from df2 will go: 3)第三步是找到df2中的值所在的位置:

I wanna loop through each value in df1. 我想遍历df1中的每个值。 and match each value with the column name in the new df in the same row. 并将每个值与同一行中新df中的列名匹配。

For example, if we are looking at row 0 and go through each column; 例如,如果我们查看第0行并遍历每一列; the values in this case [1,2,3,4,5]. 在这种情况下的值[1,2,3,4,5]。 Then the row 0 of the newdf, column 1,2,3,4,5 will be filled with the corresponding values from df2. 然后,将使用来自df2的相应值填充newdf的第0行,第1,2,3,4,5列。

4) Lastly, each corresponding values in df2 (same place) will be added to the place where we found in step 2. 4)最后,将df2中的每个对应值(相同位置)添加到我们在步骤2中找到的位置。

So, the very first row of the new df will look like this: 因此,新df的第一行将如下所示:

output = {'1' : [3], '2' : [6], '3' : [8], '4' : [9], '5' : [7], '6' : [0], '7' : [0]}

output = pd.DataFrame(output)

column 6 and 7 will not be updated because we didn't have 6 and 7 in the very first row of df1. 第6和7列将不会更新,因为df1的第一行中没有6和7。

Here is my code for this process: 这是此过程的代码:

for rowidx in range(0, len(df1)):
    for columnidx in range(0,len(df1.columns)):
        new_column = df1[str(df1.columns[columnidx])][rowidx] 
        newdf.loc[newdf.index[rowidx], new_column] = df2['b' + df1.columns[columnidx][1:]][rowidx]

I think this does the job, but as I said, my actual dataset is huge with 2999999 rows and Min to Max range is 282 which means 282 columns in the new data frame. 我认为这可以完成工作,但是正如我所说,我的实际数据集非常庞大,有2999999行,最小到最大范围是282,这意味着新数据框中的282列。

So, the code above runs forever. 因此,以上代码将永远运行。 If there is a faster way to do this. 如果有更快的方法可以做到这一点。 please help me with it. 请帮助我。 I think I learned something like map, reduce?? 我想我学到了一些类似地图的东西,减少吗? but I don't know if that applied here or there are other ways for this.. 但我不知道这是否适用于此,或者有其他方法可以适用于此。

Thank you. 谢谢。

Idea is create default columns names in both DataFrame s, then concat of DataFrame.stack ed Series, add first 0 column to index, remove second level, so possible use DataFrame.unstack : 想法是在两个DataFrame中都创建默认列名称,然后concat DataFrame.stack ed Series,将第一个0列添加到索引,删除第二个级别,因此可以使用DataFrame.unstack

df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))

newdf = (pd.concat([df1.stack(), df2.stack()], axis=1)
           .set_index(0, append=True)
           .reset_index(level=1, drop=True)[1]
           .unstack(fill_value=0)
           .rename_axis(None, axis=1))
print (newdf)
   1  2  3  4  5  6  7
0  3  6  8  9  7  0  0
1  0  6  8  9  7  2  0
2  0  6  8  9  7  2  0
3  3  6  8  9  7  0  0
4  0  0  8  9  7  2  1

Another solutions: 另一个解决方案:

comp =[pd.Series(a, index=df1.loc[i]) for i, a in enumerate(df2.values)]
df = pd.concat(comp, axis=1).T.fillna(0).astype(int)
print (df)
   1  2  3  4  5  6  7
0  3  6  8  9  7  0  0
1  0  6  8  9  7  2  0
2  0  6  8  9  7  2  0
3  3  6  8  9  7  0  0
4  0  0  8  9  7  2  1

Or: 要么:

comp = [dict(zip(x, y)) for x, y in zip(df1.values, df2.values)]
c = pd.DataFrame(comp).fillna(0).astype(int)
print (c)
   1  2  3  4  5  6  7
0  3  6  8  9  7  0  0
1  0  6  8  9  7  2  0
2  0  6  8  9  7  2  0
3  3  6  8  9  7  0  0
4  0  0  8  9  7  2  1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 熊猫基于其他两个“子”框架创建数据框架 - Pandas create a data frame based on two other 'sub' frames 我有两个结构相似的数据框,我需要根据另一个更新其中一个数据框 - I have two data frames which are similar in structure, I need to update one of the data frame based on the other 基于第三个数据帧在 Pandas 中加入两个数据帧 - Join two Data Frames in Pandas based on third data frame 熊猫数据框-合并两个基于“ InStr”> 0的数据框 - Pandas Data Frame - Merge Two Data Frames based on “InStr” > 0 比较两个数据框,然后根据另一个将新列添加到其中一个数据框 - Compare two dataframes, and then add new column to one of the data frames based on the other 基于行索引将数据帧拆分为两个不相交的子帧 - Split data frame in two disjoint sub frames based on row index 如果满足基于同一数据帧中其他2列的行值的条件,则在数据帧的列行中填充值 - Filling values in rows of column in a data frame, if condition based on 2 other columns row values in the same data frame is met 如何在我的主数据框中创建一个新列,根据它们共有的两列填充较小数据集中的值? - how do I create a new column in my main data frame filling in the values from a smaller dataset based on two columns they have in common? 基于 3 个数据框有条件地创建数据框 - Creating data frame conditionally based on 3 data frames Append 2 个数据帧变成 1 个新数据帧 - Append 2 data frames into 1 new data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM