简体   繁体   English

Python Pandas:比较一列中的两个数据帧,并返回另一个数据帧中两个数据帧的行内容

[英]Python Pandas : compare two data-frames along one column and return content of rows of both data frames in another data frame

  1. I am working with two csv files and imported as dataframe, df1 and df2 我正在使用两个csv文件并导入为dataframe,df1和df2
  2. df1 has 50000 rows and df2 has 150000 rows. df1有50000行,df2有150000行。
  3. I want to compare (iterate through each row) the 'time' of df2 with df1, find the difference in time and return the values of all column corresponding to similar row, save it in df3 ( time synchronization ) 我想比较(遍历每一行)df2的'时间'和df1,找到时间的差异并返回对应于相似行的所有列的值,保存在df3( 时间同步
  4. For example, 35427949712 (of 'time' in df1) is nearest or equal to 35427949712 (of 'time' in df2), So I would like to return the contents to df1 ('velocity_x' and 'yaw') and df2 ('velocity' and 'yawrate') and save in df3 例如,35427949712(df1中的'time') 最接近或等于 35427949712(df2中的'time'),所以我想将内容返回到df1('velocity_x'和'yaw')和df2('速度'和'偏航')并保存在df3中
  5. For this i used two techniques, shown in code. 为此,我使用了两种技术,如代码所示。
  6. Code 1 takes very long time to execute 72 hours which is not practice since i have lot of csv files 代码1需要很长时间才能执行72小时,这不是练习,因为我有很多csv文件
  7. Code 2 gives me "memory error" and kernel dies. 代码2给了我“内存错误”,内核死了。

Would be great, if I get a more robust solution for the problem considering computational time, memory and power(Intel Core i7-6700HQ, 8 GB Ram) 如果考虑到计算时间,内存和功耗(英特尔酷睿i7-6700HQ,8 GB Ram),我会得到一个更强大的问题解决方案,那将会很棒

Here is the sample data, 这是样本数据,

import pandas as pd
df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860], 
                    'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],
                   'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})

df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705], 
                    'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],
                    'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})

df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])

Code1 代码1

 for index, row in df1.iterrows():
    min=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min:
            min = abs(float(row['time'])-float(rows['time']))
            #storing the position 
            pos = indexer
    df3.loc[index,'time'] = df1['time'][pos]
    df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]
    df3.loc[index,'yaw'] = df1['yaw'][pos]
    df3.loc[index,'velocity'] = df2['velocity'][pos]
    df3.loc[index,'yawrate'] = df2['yawrate'][pos]

Code2 码2

df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)

df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time']  - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)

df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]

You're looking for pandas.merge_asof . 您正在寻找pandas.merge_asof It allows you to combine 2 DataFrame s on a key, in this case time , without the requirement that they are an exact match. 它允许您在一个键上组合2个DataFrame ,在这种情况下是time ,而不要求它们完全匹配。 You can choose a direction for prioritizing the match, but in this case it's obvious that you want nearest 你可以选择一个direction来确定匹配的优先次序,但在这种情况下,显然你想要nearest

A “nearest” search selects the row in the right DataFrame whose 'on' key is closest in absolute distance to the left's key. “最近”搜索选择右侧DataFrame中的行,其中“on”键与左侧键的绝对距离最近。

One caveat is that you need to sort things for merge_asof to work. 需要注意的是,您需要对merge_asof进行排序才能正常工作。

import pandas as pd

pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
#          time  velocity   yawrate  velocity_x       yaw
#0  35427009860   12.6556 -0.074351     12.5451 -0.078781
#1  35427029728   12.6556 -0.074351     12.5451 -0.078781
#2  35427049705   12.6444 -0.074351     12.5451 -0.078781
#3  35427929709   12.6583 -0.075049     12.5351 -0.079489
#4  35427949712   12.6556 -0.075049     12.5401 -0.079591

Just be careful about which DataFrame you choose as the left or right frame, as that changes the result. 请注意您选择哪个DataFrame作为左框架或右框架,因为这会更改结果。 In this case I'm selecting the time in df1 which is closest in absolute distance to the time in df2 . 在这种情况下,我选择timedf1最接近的绝对距离的timedf2

You also need to be careful if you have duplicated on keys in the right df because for exact matches, merge_asof only merges the last sorted row of the right df to the left df , instead of creating multiple entries for each exact match. 如果右侧df键重复on则还需要小心,因为对于完全匹配, merge_asof仅将右侧df的最后一个排序行合并到左侧df ,而不是为每个完全匹配创建多个条目。 If that's a problem, you can instead merge the exact keys first to get all of the combinations, and then merge the remainder with asof. 如果这是一个问题,您可以先将精确键合并以获得所有组合,然后将余数与asof合并。

just a side note (as not an answer) 只是旁注(不是答案)

    min_delta=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min_delta:
            min_delta = abs(float(row['time'])-float(rows['time']))
            #storing the position
            pos = indexer

can be written as 可写成

    diff = np.abs(row['time'] - df2['time'])
    pos = np.argmin(diff)

(always avoid for loops) (总是避免循环)

and don't call your vars with a built-in name ( min ) 并且不要使用内置名称调用您的变量( min

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在python中匹配两个pandas数据帧的列名 - Matching the column names of two pandas data-frames in python 比较长度不相等的两个数据帧的行 - Compare rows of two data-frames with unequal lengths 比较两个数据框并根据多个条件删除行 - Compare two data-frames and removes rows based on multiple conditions 比较具有不同列名的两个数据框,并使用来自第二个数据框的列更新第一个数据框 - Compare two data-frames with different column names and update first data-frame with the column from second data-frame 如何在 Python Pandas 中合并此数据帧? - How to Merge this Data-frames in Python Pandas? 可以使用比较来合并两个熊猫数据框吗? - Can one use comparisons to merge two pandas data-frames? 在python中连接两个具有相同行数的数据帧 - Concatenating two data-frames having same number of rows in python Pandas:基于一个公共列组合两个不同形状的数据帧 - Pandas: Combine two data-frames with different shape based on one common column 根据 dataframe 中的 id 比较两个数据帧列 - Compare the two data-frames columns on the basis of id's in the dataframe Python:如何使用 2 个数据帧创建新的数据帧,这两个数据帧之一的数据必须被其他列名替换 - Python: How To create new data-frame with 2 data-frames that datas from one of those two have to be replaced by other's column name
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM