[英]Erroneous merging of two data frames
Hi I have two dataframes嗨,我有两个数据框
df1 df1
>>> print(df1)
Unnamed: 0 Latitude Longitude Altitude
0 0 -89.9 -180.0 0.0
1 1 -89.9 -179.0 0.0
2 2 -89.9 -178.0 0.0
3 3 -89.9 -177.0 0.0
4 4 -89.9 -176.0 0.0
... ... ... ... ...
64795 64795 89.1 175.0 267.0
64796 64796 89.1 176.0 275.0
64797 64797 89.1 177.0 283.0
64798 64798 89.1 178.0 288.0
64799 64799 89.1 179.0 292.0
[64800 rows x 4 columns]
df2 df2
>>> print(df2)
Unnamed: 0 Latitude Longitude Value
0 0 -89.9 -180.0 -7.414481
1 1 -89.9 -179.0 -7.413804
2 2 -89.9 -178.0 -7.413334
3 3 -89.9 -177.0 -7.413073
4 4 -89.9 -176.0 -7.413023
... ... ... ... ...
64795 64795 89.1 175.0 -7.790705
64796 64796 89.1 176.0 -7.783264
64797 64797 89.1 177.0 -7.776208
64798 64798 89.1 178.0 -7.769540
64799 64799 89.1 179.0 -7.763257
[64800 rows x 4 columns]
I want to merge the two dataframes and change the values from df2 as a function of the height from df1.我想合并两个数据框并将 df2 的值更改为 df1 的高度的函数。 The cooridinates are the same for both dfs, so why is there an increase in data
两个dfs的坐标相同,为什么数据会增加
def recomb_data(df1, df2):
dfdiff = pd.merge(df1, df2,
on=['Latitude', 'Longitude'],
how='outer')
dfdiff["a_diffO"] = (dfdiff["Altitude"] *-0.002)
sum_column = (dfdiff["Value"]+ dfdiff["a_diffO"])
dfdiff['ValueHeight'] = sum_column
return (dfdiff)
However, the dataframe has significantly moredata.但是,数据框有更多的数据。
>>> print(dfdiff)
Unnamed: 0_x Latitude Longitude Altitude Unnamed: 0_y Value a_diffO ValueHeight
0 0 -89.9 -180.0 0.0 0 -7.414481 -0.000 -7.414481
1 1 -89.9 -179.0 0.0 1 -7.413804 -0.000 -7.413804
2 2 -89.9 -178.0 0.0 2 -7.413334 -0.000 -7.413334
3 3 -89.9 -177.0 0.0 3 -7.413073 -0.000 -7.413073
4 4 -89.9 -176.0 0.0 4 -7.413023 -0.000 -7.413023
... ... ... ... ... ... ... ... ...
67551 64795 89.1 175.0 267.0 64795 -7.790705 -0.534 -8.324705
67552 64796 89.1 176.0 275.0 64796 -7.783264 -0.550 -8.333264
67553 64797 89.1 177.0 283.0 64797 -7.776208 -0.566 -8.342208
67554 64798 89.1 178.0 288.0 64798 -7.769540 -0.576 -8.345540
67555 64799 89.1 179.0 292.0 64799 -7.763257 -0.584 -8.347257
[67556 rows x 8 columns]
What am I doing wrong?我究竟做错了什么? Am I using the wrong merging method or what could be the cause of the error?
我使用了错误的合并方法还是导致错误的原因是什么?
my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# from Subfunctions.Extract_elevation import dfl
# from Subfunctions.Difference_calculation import dfdiff, dfdiffe
from mpl_toolkits.basemap import Basemap
def load_elev():
el = pd.read_csv(r'Df1')
el.drop_duplicates(subset=['Longitude', 'Latitude'])
return el
def Load_grid():
grd = pd.read_csv(r'Df2')
grd.drop_duplicates(subset=['Longitude', 'Latitude'])
return grd
def recomb_data(el, grd):
dfdiff = pd.merge(el, grd,
on=['Latitude', 'Longitude'],
how='inner')
dfdiff["a_diffO"] = (dfdiff["Altitude"] *-0.002)
sum_column = (dfdiff["Value"]+ dfdiff["a_diffO"])
dfdiff['ValueHeight'] = sum_column
return (dfdiff)
el = load_elev()
grd = Load_grid()
print(el)
print(grd)
(el.sort_values(['Latitude', 'Longitude'])[['Latitude', 'Longitude']].values == grd.sort_values(['Latitude', 'Longitude'])[['Latitude', 'Longitude']].values).all()
dfdiff = recomb_data(el, grd)
print(dfdiff)
If you are sure the latitude and longitude columns are identical between the two datasets, then the only explanation for a larger result after the merge is that some rows contain duplicate coordinates.如果您确定两个数据集之间的纬度和经度列相同,那么合并后较大结果的唯一解释是某些行包含重复的坐标。 Try using drop_duplicates() on both inputs before the merge to see if that changes the result.
在合并之前尝试在两个输入上使用 drop_duplicates() 以查看是否会更改结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.