[英]Filling in a new data frame based on two other data frames
I want an efficient way to solve this problem below because my code seems inefficient. 我想在下面解决此问题的有效方法,因为我的代码效率低下。
First of all, let me provide a dummy dataset. 首先,让我提供一个虚拟数据集。
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df1= {'a0' : [1,2,2,1,3], 'a1' : [2,3,3,2,4], 'a2' : [3,4,4,3,5], 'a3' : [4,5,5,4,6], 'a4' : [5,6,6,5,7]}
df2 = {'b0' : [3,6,6,3,8], 'b1' : [6,8,8,6,9], 'b2' : [8,9,9,8,7], 'b3' : [9,7,7,9,2], 'b4' : [7,2,2,7,1]}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
My actual dataset has more than 100,000 rows and 15 columns. 我的实际数据集有超过100,000行和15列。 Now, what I want to do is pretty complicated to explain, but here we go.
现在,我想做的事情解释起来很复杂,但是现在我们开始吧。
Goal: I want to create a new df using the two dfs above. 目标:我想使用上面的两个df创建一个新的df。
1) find the global min and max from df1. 1)从df1中找到全局最小值和最大值。 Since the value is sorted by row, column 'a' will always have the minimum each row, and 'e' will have the maximum.
由于该值是按行排序的,所以“ a”列始终每行最少,而“ e”列却最大。 Therefore, I will find the minimum in column 'a0' and maximum in 'a4'.
因此,我将在“ a0”列中找到最小值,在“ a4”列中找到最大值。
Min = df1['a0'].min()
Max = df1['a4'].max()
Min
Max
2) Then I will create a data frame filled with 0s and columns of range(Min, Max). 2)然后,我将创建一个数据框,其中填充了0和范围(最小,最大)的列。 In this case, 1 through 7.
在这种情况下,请输入1到7。
column = []
for i in np.arange(Min, Max+1):
column.append(i)
newdf = pd.DataFrame(0, index = df1.index, columns=column)
3) The third step is to find the place where the values from df2 will go: 3)第三步是找到df2中的值所在的位置:
I wanna loop through each value in df1. 我想遍历df1中的每个值。 and match each value with the column name in the new df in the same row.
并将每个值与同一行中新df中的列名匹配。
For example, if we are looking at row 0 and go through each column; 例如,如果我们查看第0行并遍历每一列; the values in this case [1,2,3,4,5].
在这种情况下的值[1,2,3,4,5]。 Then the row 0 of the newdf, column 1,2,3,4,5 will be filled with the corresponding values from df2.
然后,将使用来自df2的相应值填充newdf的第0行,第1,2,3,4,5列。
4) Lastly, each corresponding values in df2 (same place) will be added to the place where we found in step 2. 4)最后,将df2中的每个对应值(相同位置)添加到我们在步骤2中找到的位置。
So, the very first row of the new df will look like this: 因此,新df的第一行将如下所示:
output = {'1' : [3], '2' : [6], '3' : [8], '4' : [9], '5' : [7], '6' : [0], '7' : [0]}
output = pd.DataFrame(output)
column 6 and 7 will not be updated because we didn't have 6 and 7 in the very first row of df1. 第6和7列将不会更新,因为df1的第一行中没有6和7。
Here is my code for this process: 这是此过程的代码:
for rowidx in range(0, len(df1)):
for columnidx in range(0,len(df1.columns)):
new_column = df1[str(df1.columns[columnidx])][rowidx]
newdf.loc[newdf.index[rowidx], new_column] = df2['b' + df1.columns[columnidx][1:]][rowidx]
I think this does the job, but as I said, my actual dataset is huge with 2999999 rows and Min to Max range is 282 which means 282 columns in the new data frame. 我认为这可以完成工作,但是正如我所说,我的实际数据集非常庞大,有2999999行,最小到最大范围是282,这意味着新数据框中的282列。
So, the code above runs forever. 因此,以上代码将永远运行。 If there is a faster way to do this.
如果有更快的方法可以做到这一点。 please help me with it.
请帮助我。 I think I learned something like map, reduce??
我想我学到了一些类似地图的东西,减少吗? but I don't know if that applied here or there are other ways for this..
但我不知道这是否适用于此,或者有其他方法可以适用于此。
Thank you. 谢谢。
Idea is create default columns names in both DataFrame
s, then concat
of DataFrame.stack
ed Series, add first 0
column to index, remove second level, so possible use DataFrame.unstack
: 想法是在两个
DataFrame
中都创建默认列名称,然后concat
DataFrame.stack
ed Series,将第一个0
列添加到索引,删除第二个级别,因此可以使用DataFrame.unstack
:
df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
newdf = (pd.concat([df1.stack(), df2.stack()], axis=1)
.set_index(0, append=True)
.reset_index(level=1, drop=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (newdf)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Another solutions: 另一个解决方案:
comp =[pd.Series(a, index=df1.loc[i]) for i, a in enumerate(df2.values)]
df = pd.concat(comp, axis=1).T.fillna(0).astype(int)
print (df)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Or: 要么:
comp = [dict(zip(x, y)) for x, y in zip(df1.values, df2.values)]
c = pd.DataFrame(comp).fillna(0).astype(int)
print (c)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.