简体   繁体   中英

What is the most efficient way to repeatedly search a large text file (800 MB) for certain numbers?

The large file is 12 million lines of text such as this:

81.70,  89.86,  717.985
81.74,  89.86,  717.995
81.78,  89.86,  718.004
81.82,  89.86,  718.014
81.86,  89.86,  718.024
81.90,  89.86,  718.034

This is latitude, longitude, and distance from the nearest coastline (respectively).

My code uses coordinates of known places (for example: Mexico City: "-99.1, 19.4) and searches the large file, line by line, to output the distance from the nearest coastline of that coordinate.

I put each line into a list because many lines meet the long/lat criteria. I later average the distances from the coastline.

Each coordinate takes about 12 seconds to retrieve. My entire script takes 14 minutes to complete.

Here's what I have been using:

long = -99.1
lat = 19.4
country_d2s = []

# outputs all list items with specified long and lat values
with open(r"C:\Users\jason\OneDrive\Desktop\s1186prXbF0O", 'r') as dist2sea:
    for line in dist2sea:
        if long in line and lat in line and line.startswith(long):
             country_d2s.append(line)

I am looking for a way to search through the file much quicker and/or rewrite the file to make it easier to work with.

Use a database with a key comprised of the latitude and longitude. If you're looking for a lightweight DB that can be shared as a file, there's SqliteDict or bsddb3 . That would be much faster than reading a text file each time the program is run.

Import your data into SQLite database, then create index for (latitude, longitude) . Index lookup should take milliseconds. To read data, use python SQLite module.

Comments:

  • It's unclear if you are using the fact that your long/lat are XX.Y and you are searching against XX.YY as some kind of fuzzy matching technique.
  • I also cannot tell how you plan to execute this: load + [run] x 1000 vs [load + run] x 1000 , which would inform the solution you want to use.

That being said, if you want to do very fast exact lookups one option is to load the entire thing into memory as a mapping, eg {(long, lat): coast_distance, ...} . Since floats are not good keys, it would be better to use strings, integers, or fractions for this.

If you want to do fuzzy matching, there are data structures (and a number of packages) that would solve that issue:

If you want the initial load time to be faster you can do things like writing a binary pickle and loading that directly instead of parsing a file. A database is also a simple solution to this.

You could partition the file into 10 by 10 degree patches. This would reduce the search space by 648 which would yield 648 files with each one having about 18500 lines. This would reduce the search time to about 0.02 seconds.

As you are doing exact matches of lat-long, you could instead use any on-disk key-value store. Python has at least one of them built in. If you were doing nearest neighbor or metric space searches, there are spacial databases that support those.

If you are using python i recommend to use PySpark. in this particular case you can use the function mapPartitions and join the results. this could help How does the pyspark mapPartitions function work?

PySpark is a useful at the time to work with giant amount of data because it makes N partitions and use your processor full power.

Hope it helps you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM