简体   繁体   English

逐行、逐个单元地比较 2 个 Pandas 数据帧

[英]Compare 2 Pandas dataframes, row by row, cell by cell

I have 2 dataframes, df1 and df2 , and want to do the following, storing results in df3 :我有 2 个数据帧, df1df2 ,并且想要执行以下操作,将结果存储在df3

for each row in df1:

    for each row in df2:

        create a new row in df3 (called "df1-1, df2-1" or whatever) to store results 

        for each cell(column) in df1: 

            for the cell in df2 whose column name is the same as for the cell in df1:

                compare the cells (using some comparing function func(a,b) ) and, 
                depending on the result of the comparison, write result into the 
                appropriate column of the "df1-1, df2-1" row of df3)

For example, something like:例如,类似于:

df1
A   B    C      D
foo bar  foobar 7
gee whiz herp   10

df2
A   B   C      D
zoo car foobar 8

df3
df1-df2 A             B              C                   D
foo-zoo func(foo,zoo) func(bar,car)  func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar)   func(10,8)

I've started with this:我从这个开始:

for r1 in df1.iterrows():
    for r2 in df2.iterrows():
        for c1 in r1:
            for c2 in r2:

but am not sure what to do with it, and would appreciate some help.但我不确定如何处理它,并希望得到一些帮助。

So to continue the discussion in the comments, you can use vectorization, which is one of the selling points of a library like pandas or numpy.因此,要继续在评论中进行讨论,您可以使用矢量化,这是像 pandas 或 numpy 这样的库的卖点之一。 Ideally, you shouldn't ever be calling iterrows() .理想情况下,您不应该调用iterrows() To be a little more explicit with my suggestion:更明确一点我的建议:

# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']

# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0    foofoofoozoo
1             NaN
Name: A, dtype: object

# more generally

# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns) 
for colName in df1:
    df3[colName] = func(df1[colName], df2[colName]) 

Now, you could even have different functions applied to different columns by, say, creating lambda functions and then zipping them with the column names:现在,您甚至可以通过创建 lambda 函数然后使用列名压缩它们来将不同的函数应用于不同的列:

# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]

# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
    df3[colName] = func(df1[colName], df2[colName])

The only "gotcha" that comes to mind is that you need to be sure that your function is applicable to the data in your columns.唯一想到的“问题”是您需要确保您的函数适用于您的列中的数据。 For instance, if you were to do something like df1['A'] - df2['A'] (with df1, df2 as you have provided), that would raise a ValueError as the subtraction of two strings is undefined.例如,如果您要执行类似df1['A'] - df2['A'] (使用您提供的 df1, df2)之类的操作,则会引发ValueError因为两个字符串的减法未定义。 Just something to be aware of.只是需要注意的事情。


Edit, re: your comment: That is doable as well.编辑,回复:您的评论:这也是可行的。 Iterate over the dfX.columns that is larger, so you don't run into a KeyError , and throw an if statement in there:遍历就是更大,这样你就不会碰到的dfX.columns KeyError ,并抛出一个if语句有:

# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
    if colName not in df1:
        df3[colName] = np.nan # be sure to import numpy as np
    else:
        df3[colName] = func(df1[colName], df2[colName])  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM