简体   繁体   English

加速处理500万行坐标数据

[英]speeding up processing 5 million rows of coordinate data

I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data. 我有一个包含两列(纬度,经度)的csv文件,其中包含超过500万行的地理位置数据。 I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column ( CloseToAnotherPoint ) which is True if there is another point is within 5 miles, and False if there isn't. 我需要识别与列表中任何其他点不在5英里范围内的点,并将所有内容输出回另一个具有额外列( CloseToAnotherPoint )的CSV,如果有另一个点在5英里范围内则为True ,并且为False如果没有。

Here is my current solution using geopy (not making any web calls, just using the function to calculate distance): 这是我目前使用geopy解决方案(不进行任何Web调用,只使用函数来计算距离):

from geopy.point import Point
from geopy.distance import vincenty
import csv


class CustomGeoPoint(object):
    def __init__(self, latitude, longitude):
        self.location = Point(latitude, longitude)
        self.close_to_another_point = False


try:
    output = open('output.csv','w')
    writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
    writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])

    # 5 miles
    close_limit = 5
    geo_points = []

    with open('geo_input.csv', newline='') as geo_csv:
        reader = csv.reader(geo_csv)
        next(reader, None) # skip the headers
        for row in reader:
            geo_points.append(CustomGeoPoint(row[0], row[1]))

    # for every point, look at every point until one is found within 5 miles
    for geo_point in geo_points:
        for geo_point2 in geo_points:
            dist = vincenty(geo_point.location, geo_point2.location).miles
            if 0 < dist <= close_limit: # (0,close_limit]
                geo_point.close_to_another_point = True
                break
        writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
                         geo_point.close_to_another_point])

finally:
    output.close()

As you might be able to tell from looking at it, this solution is extremely slow. 正如您可以从中看到的那样,这种解决方案非常缓慢。 So slow in fact that I let it run for 3 days and it still didn't finish! 实际上这么慢,我让它运行了3天 ,但仍然没有完成!

I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth. 我曾经考虑过将数据拆分成块(多个CSV文件或其他东西),这样内部循环就不必查看其他所有点,但是我必须弄清楚如何确保边界每个部分都按照相邻部分的边界进行检查,这看起来过于复杂,我担心它会比它的价值更令人头痛。

So any pointers on how to make this faster? 那么关于如何加快速度的任何指针呢?

Let's look at what you're doing. 让我们来看看你在做什么。

  1. You read all the points into a list named geo_points . 您将所有点读入名为geo_points的列表中。

    Now, can you tell me whether the list is sorted? 现在,您能告诉我列表是否已排序吗? Because if it was sorted, we definitely want to know that. 因为如果它被排序,我们肯定想知道。 Sorting is valuable information, especially when you're dealing with 5 million of anything. 排序是有价值的信息,特别是当你处理500万的任何事情时。

  2. You loop over all the geo_points . 循环遍历所有geo_points That's 5 million, according to you. 据你说,这是500万。

  3. Within the outer loop, you loop again over all 5 million geo_points . 在外部循环中,您再次遍历所有500万个geo_points

  4. You compute the distance in miles between the two loop items. 您计算两个循环项之间的距离(英里)。

  5. If the distance is less than your threshold, you record that information on the first point, and stop the inner loop. 如果距离小于阈值,则在第一个点上记录该信息,并停止内循环。

  6. When the inner loop stops, you write information about the outer loop item to a CSV file. 当内循环停止时,您将有关外循环项的信息写入CSV文件。

Notice a couple of things. 注意几件事。 First, you're looping 5 million times in the outer loop. 首先,你在外循环中循环500万次。 And then you're looping 5 million times in the inner loop. 然后你在内循环中循环500万次。

This is what O(n²) means. 这就是O(n²)的含义。

The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. 下次你看到有人在谈论“哦,这是O(log n),但另一件事是O(n log n),”记住这个经验 - 你正在运行一个n²算法,其中n在这种情况下是5,000,000。 Sucks, dunnit? 糟透了,dunnit?

Anyway, you have some problems. 无论如何,你有一些问题。

Problem 1: You'll eventually wind up comparing every point against itself. 问题1:你最终会比较每一点与自己的比较。 Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. 它应该具有零距离,这意味着它们将被标记为在任何距离阈值内。 If your program ever finishes, all the cells will be marked True. 如果您的程序完成,则所有单元格都将标记为True。

Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. 问题2:当您将点#1与点#12345进行比较,并且它们在彼此的阈值距离内时,您正在记录关于点#1的信息。 But you don't record the same information about the other point. 但是你没有记录关于另一点的相同信息 You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. 知道点#12345(geo_point2)是在#1点的阈值内反射,但是你不要写下来。 So you're missing a chance to just skip over 5 million comparisons. 所以你错过了跳过超过500万次比较的机会。

Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? 问题3:如果比较点#1和点#2,并且它们不在阈值距离内,那么当您将点#2与点#1进行比较时会发生什么? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. 您的内部循环每次都从列表的开头开始,但您知道已经将列表的开头与列表的末尾进行了比较。 You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million) . 你可以将你的问题空间缩小一半,只需使你的外环i in range(0, 5million) ,你的内环j in range(i+1, 5million)内变化j in range(i+1, 5million)

Answers? 答案?

Consider your latitude and longitude on a flat plane. 考虑平面上的纬度和经度。 You want to know if there's a point within 5 miles. 你想知道5英里内是否有一个点。 Let's think about a 10 mile square, centered on your point #1. 让我们考虑一个10英里的广场,以你的观点#1为中心。 That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). 这是一个以(X1,Y1)为中心的正方形,左上角位于(X1 - 5英里,Y1 + 5英里),右下角位于(X1 + 5英里,Y1 - 5英里)。 Now, if a point is within that square, it might not be within 5 miles of your point #1. 现在,如果一个点在该广场内,它可能不在你的点#1的5英里范围内。 But you can bet that if it's outside that square, it's more than 5 miles away. 但你可以打赌,如果它在那个广场之外,它距离超过5英里。

As @SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. 正如@SeverinPappadeaux指出的那样,像地球这样的球体上的距离与平面上的距离并不完全相同。 But so what? 但那又怎么样? Set your square a little bigger to allow for the difference, and proceed! 将您的方块设置得稍大一些,以便区别对待,然后继续!

Sorted List 排序列表

This is why sorting is important. 这就是排序很重要的原因。 If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. 如果所有的点都按X排序,那么Y(或Y,然后是X - 无论如何) 并且你知道它,你可以真正加快速度。 Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points. 因为当X(或Y)坐标太大时你可以简单地停止扫描,而你不需要经过500万点。

How would that work? 那会怎么样? Same way as before, except your inner loop would have some checks like this: 和以前一样,除了你的内部循环会有一些像这样的检查:

five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times

for i, pi in enumerate(geo_points):

    if pi.close_to_another_point:
        continue   # Remember if close to an earlier point

    pi0max = pi[0] + five_miles
    pi1min = pi[1] - five_miles
    pi1max = pi[1] + five_miles

    for j in range(i+1, list_len):
        pj = geo_points[j]
        # Assumes geo_points is sorted on [0] then [1]
        if pj[0] > pi0max:
            # Can't possibly be close enough, nor any later points
            break
        if pj[1] < pi1min or pj[1] > pi1max:
            # Can't be close enough, but a later point might be
            continue

        # Now do "real" comparison using accurate functions.
        if ...:
            pi.close_to_another_point = True
            pj.close_to_another_point = True
            break

What am I doing there? 我在那做什么? First, I'm getting some numbers into local variables. 首先,我在局部变量中得到一些数字。 Then I'm using enumerate to give me an i value and a reference to the outer point. 然后我使用enumerate给我一个i对外点的引用。 (What you called geo_point ). (你所谓的geo_point )。 Then, I'm quickly checking to see if we already know that this point is close to another one. 然后,我正在快速检查,看看我们是否已经知道这一点与另一个点相近。

If not, we'll have to scan. 如果没有,我们将不得不扫描。 So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. 所以我只扫描列表中的“后面”点,因为我知道外部循环扫描早期的点,我绝对不想比较一个点与它自己。 I'm using a few temporary variables to cache the result of computations involving the outer loop. 我正在使用一些临时变量来缓存涉及外部循环的计算结果。 Within the inner loop, I do some stupid comparisons against the temporaries. 在内循环中,我对临时演员进行了一些愚蠢的比较。 They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead. 他们不能告诉我这两个点是否彼此接近,但我可以检查他们是否肯定没有接近并跳过。

Finally, if the simple checks pass then go ahead and do the expensive checks. 最后,如果简单的检查通过,那么继续进行昂贵的检查。 If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later. 如果支票实际通过,请务必在两个点上记录结果,这样我们可以稍后跳过第二个点。

Unsorted List 未排序列表

But what if the list is not sorted? 但是如果列表没有排序呢?

@RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). @RootTwo指向一个kD树(其中D代表“维”,k在这种情况下是“2”)。 The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). 如果您已经了解二元搜索树,那么这个想法非常简单:您可以在维度中循环,在树中的偶数级别比较X,在奇数级别比较Y(反之亦然)。 The idea would be this: 这个想法是这样的:

def insert_node(node, treenode, depth=0):
    dimension = depth % 2  # even/odd -> lat/long
    dn = node.coord[dimension]
    dt = treenode.coord[dimension]

    if dn < dt:
        # go left
        if treenode.left is None:
            treenode.left = node
        else:
            insert_node(node, treenode.left, depth+1)
    else:
        # go right
        if treenode.right is None:
            treenode.right = node
        else:
            insert_node(node, treenode.right, depth+1)

What would this do? 这会是什么? This would get you a searchable tree where points could be inserted in O(log n) time. 这将为您提供一个可搜索的树,其中点可以插入O(log n)时间。 That means O(n log n) for the whole list, which is way better than n squared! 这意味着整个列表的O(n log n),这比n平方更好! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!) (500万的基数2基本上是23.因此n log n是500万次23,而500万次是500万!)

It also means you can do a targeted search. 这也意味着您可以进行有针对性的搜索。 Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from @RootTwo provides an algorithm). 由于树是有序的,因此查找“关闭”点非常简单(来自@RootTwo的Wikipedia链接提供了算法)。

Advice 忠告

My advice is to just write code to sort the list, if needed. 我的建议是只需编写代码来对列表进行排序(如果需要)。 It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time. 它更容易编写,更容易手工检查,而且它只是一次单独传递,你只需要做一次。

Once you have the list sorted, try the approach I showed above. 列表排序后,请尝试上面显示的方法。 It's close to what you were doing, and it should be easy for you to understand and code. 它接近你正在做的事情,你应该很容易理解和编码。

As the answer to Python calculate lots of distances quickly points out, this is a classic use case for kD trees. 由于Python的答案很快就会计算出很多距离 ,这是kD树的经典用例。

An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python? 另一种方法是使用扫描线算法,如如何使用Python匹配类似坐标的答案中所示

Here's the sweep line algorithm adapted for your questions. 这是适用于您的问题的扫描线算法。 On my laptop, it takes < 5 minutes to run through 5M random points. 在我的笔记本电脑上,运行5M随机点需要不到5分钟。

import itertools as it
import operator as op
import sortedcontainers     # handy library on Pypi
import time

from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform

Point = namedtuple("Point", "lat long has_close_neighbor")

miles_per_degree = 69

number_of_points = 5000000
data = [Point(uniform( -88.0,  88.0),     # lat
              uniform(-180.0, 180.0),     # long
              True
             )
        for _ in range(number_of_points)
       ]

start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()

print(time.time() - start)

# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0  # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2

# Sliding lattitude window.  Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))

# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)

milepost = len(data)//10

# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
    if i == milepost:
        print('.', end=' ')
        milepost += len(data)//10

    # Dec of lead_obs represents the leading edge of window.
    window.add(lead_pt)

    # Remove observations further than the trailing edge of window.
    while lead_pt.lat - lag_pt.lat > lat_span:
        window.discard(lag_pt)
        lag_pt = next(point)

    # Calculate 'east-west' width of window_size at dec of lead_obs
    long_span = lat_span / cos(radians(lead_pt.lat))
    east_long = lead_pt.long + long_span
    west_long = lead_pt.long - long_span

    # Check all observations in the sliding window within
    # long_span of lead_pt.
    for other_pt in window.irange_key(west_long, east_long):

        if other_pt != lead_pt:
            # lead_pt is at the top center of a box 2 * long_span wide by 
            # 1 * long_span tall.  other_pt is is in that box. If desired, 
            # put additional fine-grained 'closeness' tests here. 

            # coarse check if any pts within 80% of threshold distance
            # then don't need to check distance to any more neighbors
            average_lat = (other_pt.lat + lead_pt.lat) / 2
            delta_lat   = other_pt.lat - lead_pt.lat
            delta_long  = (other_pt.long - lead_pt.long)/cos(radians(average_lat))

            if delta_lat**2 + delta_long**2 <= coarse_threshold:
                break

            # put vincenty test here
            #if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
            #    break

    else:
        data[i] = data[i]._replace(has_close_neighbor=False)

print()      
print(time.time() - start)

If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). 如果按纬度(n log(n))对列表进行排序,并且点大致均匀分布,则每个点在5英里内将其降低到大约1000点(餐巾纸数学,不精确)。 By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. 通过仅查看纬度附近的点,运行时从n ^ 2变为n * log(n)+。0004n ^ 2。 Hopefully this speeds it up enough. 希望这足以加快速度。

I would give pandas a try. 我会尝试一下熊猫 Pandas is made for efficient handling of large amounts of data. Pandas用于高效处理大量数据。 That may help with the efficiency of the csv portion anyhow. 无论如何,这可能有助于提高csv部分的效率。 But from the sounds of it, you've got yourself an inherently inefficient problem to solve. 但是从它的声音来看,你已经找到了一个固有的低效问题需要解决。 You take point 1 and compare it against 4,999,999 other points. 你拿第1点并将它与4,999,999个其他点进行比较。 Then you take point 2 and compare it with 4,999,998 other points and so on. 然后你拿第2点并将它与4,999,998个其他点进行比较,依此类推。 Do the math. 算一算。 That's 12.5 trillion comparisons you're doing. 这是您正在进行的12.5 万亿次比较。 If you can do 1,000,000 comparisons per second, that's 144 days of computation. 如果你每秒可以进行1,000,000次比较,那就是144天的计算。 If you can do 10,000,000 comparisons per second, that's 14 days. 如果你每秒可以做10,000,000次比较,那就是14天。 For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. 对于直接python中的添加,10,000,000次操作可能需要1.1秒,但我怀疑你的比较与添加操作一样快。 So give it at least a fortnight or two. 所以至少要两周或两周。

Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind. 或者,您可以提出一种替代算法,但我没有考虑任何特定的算法。

I would redo algorithm in three steps: 我会分三步重做算法:

  1. Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit. 使用大圆距离,并假设1%的误差,因此限制等于1.01 *限制。

  2. Code great-circle distance as inlined function, this test should be fast 代码大圆距离作为内联函数,此测试应该很快

  3. You'll get some false positives, which you could further test with vincenty 你会得到一些误报,你可以用vincenty进一步测试

This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty() , and cleaning up a couple of other things. 这只是第一次通过,但到目前为止,我已经通过使用great_circle()而不是vincinty()加速了一半,并清理了其他一些东西。 The difference is explained here , and the loss in accuracy is about 0.17% : 这里解释不同之处,精度损失约为0.17%

from geopy.point import Point
from geopy.distance import great_circle
import csv


class CustomGeoPoint(Point):
    def __init__(self, latitude, longitude):
        super(CustomGeoPoint, self).__init__(latitude, longitude)
        self.close_to_another_point = False


def isCloseToAnother(pointA, points):
    for pointB in points:
        dist = great_circle(pointA, pointB).miles
        if 0 < dist <= CLOSE_LIMIT:  # (0, close_limit]
            return True

    return False


    with open('geo_input.csv', 'r') as geo_csv:
        reader = csv.reader(geo_csv)
        next(reader, None)  # skip the headers

        geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))

    with open('output.csv', 'w') as output:
        writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
        writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])

        # for every point, look at every point until one is found within a mile
        for point in geo_points:
            point.close_to_another_point = isCloseToAnother(point, geo_points)
            writer.writerow([point.latitude, point.longitude,
                             point.close_to_another_point])

I'm going to improve this further. 我将进一步改进这一点。

Before: 之前:

$ time python geo.py

real    0m5.765s
user    0m5.675s
sys     0m0.048s

After: 后:

$ time python geo.py

real    0m2.816s
user    0m2.716s
sys     0m0.041s

A better solution generated from Oscar Smith. 奥斯卡史密斯产生的更好的解决方案。 You have a csv file and just sorted it in excel it is very efficient). 你有一个csv文件,只是在excel中排序它是非常有效的)。 Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition). 然后在您的程序中使用二进制搜索来查找5英里范围内的城市(您可以对二进制搜索方法进行小的更改,以便在找到满足您条件的一个城市时会中断)。 Another improvement is to set a map to remember the pair of cities when you find one city is within another one. 另一个改进是当你发现一个城市在另一个城市之内时,设置地图以记住这对城市。 For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). 例如,当您发现城市A在距离城市B 5英里范围内时,请使用地图存储该对(B是键,A是值)。 So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. 因此,下次遇到B时,首先在Map中搜索它,如果它有相应的值,则不需要再次检查。 But it may use more memory so care about it. 但它可能会使用更多的内存,所以关心它。 Hope it helps you. 希望它能帮到你。

This problem can be solved with a VP tree . 使用VP树可以解决此问题。 These allows querying data with distances that are a metric obeying the triangle inequality. 这些允许使用距离来查询数据,该距离是遵守三角不等式的度量。

The big advantage of VP trees over a kD tree is that they can be blindly applied to geographic data anywhere in the world without having to worry about projecting it to a suitable 2D space. VP树相对于kD树的巨大优势在于它们可以盲目地应用于世界任何地方的地理数据,而不必担心将其投影到合适的2D空间。 In addition a true geodesic distance can be used (no need to worry about the differences between geodesic distances and distances in the projection). 此外,可以使用真正的测地距离(无需担心测地距离和投影距离之间的差异)。

Here's my test: generate 5 million points randomly and uniformly on the world. 这是我的测试:在世界上随机均匀地产生500万点。 Put these into a VP tree. 将它们放入VP树中。

Looping over all the points, query the VP tree to find any neighbor a distance in (0km, 10km] away. (0km is not include in this set to avoid the query point being found.) Count the number of points with no such neighbor (which is 229573 in my case). 循环遍历所有点,查询VP树以找到距离(0km,10km)距离的任何邻居。(此集合中不包括0km以避免找到查询点。)计算没有这样的邻居的点数(在我的情况下是229573)。

Cost of setting up the VP tree = 5000000 * 20 distance calculations. 设置VP树的成本= 5000000 * 20距离计算。

Cost of the queries = 5000000 * 23 distance calculations. 查询成本= 5000000 * 23距离计算。

Time for setup and queries is 5m 7s. 设置和查询的时间是5分7秒。

I am using C++ with GeographicLib for calculating distances, but the algorithm can of course be implemented in any language and here's the python version of GeographicLib . 我使用C ++和GeographicLib来计算距离,但算法当然可以用任何语言实现,这里是GeographicLibpython版本

ADDENDUM : The C++ code implementing this approach is given here . 附录此处给出实现此方法的C ++代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM