使用给定的列比较两个csv文件，并使用匹配行中的特定列来构建第三个

Question

one.csv：

12.23496740, -11.95760385, 3, 5, 11.1, 4
12.58295928, -11.39857395, 4, 7, 12.3, 6
12.42572572, -11.09478502, 2, 5, 12.3, 8
12.58300286, -11.95762569, 5, 11, 3.4, 7

two.csv：

12.5830, -11.3986, .2, 4
12.4257, -11.0948, .7, 3

我想按第0列和第1列来匹配两个csv文件，并最终输出一个csv文件，其中包括one.csv中第4列和two.csv中第2列的相应值，如下所示：

三.csv

12.5830, -11.3986, 12.3, .2
12.4257, -11.0948, 12.3, .7

Answer 1

我会将两个csv文件都读入列表列表，以便您拥有csv1和csv2。 然后遍历所有它们，您将执行以下操作：

for e1 in csv1:
    for e2 in csv2:
         distance = d(e1[0],e1[1], e2[0], e2[1]) #using a function call to your distance formula

为了保存结果，您可以使用字典，以便以后以简单的方式输出。 因此，在保存新条目时：

output_dict[(e1[0], e1[1])] = [e1[3], e2[3]]

Answer 2

我不确定上述问题到底出在哪里。 如果您要使用一种算法来基于一组坐标来计算距离，请随时使用以下代码：

from math import radians, cos, sin, asin, sqrt

def haversine(lat1, lng1, lat2, lng2, metric=False):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    earths_radius_km = 6378.1
    # convert decimal degrees to radians 
    lat1, lng1, lat2, lng2 = map(radians, [lat1, lng1, lat2, lng2])
    # haversine formula 
    dlat = lat2 - lat1
    dlng = lng2 - lng1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlng/2)**2
    c = 2 * asin(sqrt(a)) 
    km = earths_radius_km * c
    if not metric:
        km_to_miles = 0.621371192
        dist = km * km_to_miles
        units = 'miles'
    else:
        dist = km
        units = 'km'
    return dist, units

if __name__ == '__main__':
    print 'Please call from within another script'
    # example...
    lat1, lng1, lat2, lng2 = 51.0820266, 1.1834209, 52.4931226, -2.1786751
    print 'e.g. distance in km is:', haversine(lat1, lng1, lat2, lng2, True)
    print 'e.g. distance in miles is:', haversine(lat1, lng1, lat2, lng2)

如果我理解正确，您是否想遍历一个文件中的坐标并找到另一个文件中最接近的匹配项？ 如果是这种情况，只需将min_distance初始化为任意高的值，例如第一组中的每个值都为1000000，然后循环遍历第二组坐标即可调用上述公式（或您要使用的任何距离函数）并将min_distance重置为如果结果小于当前的min_distance，则返回结果（并将第二个列表中所需的额外值存储在temp变量中，以便在每次找到较小的距离时被覆盖） 一旦完成了内循环的所有迭代，就可以在开始下一次外循环迭代之前将所需的数据存储在列表中。

Answer 3

我认为这不是一个好答案，但是您的问题的解决方法如下：

import sys
import math

def dist(point1, point2):
  return math.sqrt((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2)

one = []
two = []

with open('one.csv', 'r') as f:
    for line in f.readlines():
        x, y, _, _, _4, _ = line.split(',')
        one.append((float(x), float(y), float(_4)))

with open('two.csv', 'r') as f:
    for line in f.readlines():
        x, y, _2, _ = line.split(',')
        two.append((float(x), float(y), float(_2)))

with open('three.csv', 'w') as f:
    for point in two:
        nearest = None
        distance = sys.float_info.max
        for point2 in one:
            d = dist(point2, point)
            if d < distance:
                distance = d
                nearest = point2
        f.write("%f, %f, %f, %f\n" % (point[0], point[1], nearest[2], point[2]))

将产生输出到three.csv：

12.583000, -11.398600, 12.300000, 0.200000
12.425700, -11.094800, 12.300000, 0.700000

如果您需要格式化，只需在代码段的最后一行进行即可。

Answer 4

使用numpy可以numpy解决此问题：

def compare_files( f1name, f2name, f3name, ctc1, ctc2, columns, TOL=0.001 ):
    f1 = np.loadtxt( f1name, delimiter=',' )
    f2 = np.loadtxt( f2name, delimiter=',' )
    check = np.logical_and( *[np.absolute(s.outer(f1[:,i], f2[:,j])) < TOL for i,j in zip(ctc1,ctc2)] )
    chosen1 = f1[np.any( check, axis=1 )]
    chosen2 = f2[np.any( check, axis=0 )]
    newshape = (2,f1.shape[0],f2.shape[0])
    ind = np.indices(check.shape)[np.vstack((check,check)).reshape(newshape)]
    ind1 = ind[:len(ind)/2]
    ind2 = ind[len(ind)/2:]

    new = np.concatenate( [eval(f)[ind1, c][:,None] if f=='f1' else\
                           eval(f)[ind2, c][:,None] \
                           for f,c in columns], axis=1 )
    np.savetxt(f3name, new, delimiter=',', fmt='%f')

该函数是常规函数，可用于您的问题中描述的情况，如下所示：

f1name = 'one.csv'
f2name = 'two.csv'
f3name = 'three.csv'
ctc1 = [0,1] # columns to compare from file 1
#       ^ ^
#       | | # this arrows are just to emphisize who is compared with who...
#       v v
ctc2 = [0,1] # columns to compare from file 2
columns = [['f2',0], # file 2 column 0
           ['f2',1], # file 2 column 1
           ['f1',4], # file 1 column 4
           ['f1',2]] # file 1 column 2
TOL = 0.001
compare_files( f1name, f2name, f3name, ctc1, ctc2, columns, TOL )

ctc1和ctc2会告诉函数要比较哪些列（ctc）。 columns将告诉您如何构建新文件。 在此示例中，它使用f2 0列，随后的第1列，然后从f1 4列到第2列的形式进行构建。

用one.csv测试：

12.23496740, -11.95760385, 3, 5, 11.1, 4
12.58295928, -11.39857395, 4, 7, 12.3, 6
12.42572572, -11.09478502, 2, 5, 12.3, 8
12.58300286, -11.95762569, 5, 11, 3.4, 7

和two.csv ：

12.43, -11.0948, .7, 3
12.43, -11.0948, .7, 3
12.4257, -11.0948, .7, 3
12.43, -11.0948, .7, 3
12.5830, -11.3986, .2, 4

给出了three.csv ：

12.583000,-11.398600,12.300000,0.200000
12.425700,-11.094800,12.300000,0.700000

使用给定的列比较两个csv文件，并使用匹配行中的特定列来构建第三个

问题描述

4 个解决方案

解决方案1
0 2013-06-25 17:14:34

解决方案2
0 2013-06-25 17:15:50

解决方案3
0 2013-06-25 17:21:13

解决方案4
0 2013-06-25 17:38:50

使用给定的列比较两个csv文件，并使用匹配行中的特定列来构建第三个

问题描述

4 个解决方案

解决方案1 0 2013-06-25 17:14:34

解决方案2 0 2013-06-25 17:15:50

解决方案3 0 2013-06-25 17:21:13

解决方案4 0 2013-06-25 17:38:50

解决方案1
0 2013-06-25 17:14:34

解决方案2
0 2013-06-25 17:15:50

解决方案3
0 2013-06-25 17:21:13

解决方案4
0 2013-06-25 17:38:50