简体   繁体   中英

BASH extract data with conditions from multiple cvs files, sort data and remove doubles. Need to speed up process

I have an amateur site with tracks for bikers. One aim of this site is provide POI's (Points of Interess) along gpx track. I have hundreds of dbase, with thousand data in each. At the moment I created a script that search POI's in range of 1 km from a specific gpx point, add results to a file, sort them, remove doubles. It works, but as I need to repeat process for even 5K times per single track I planned to add to my site, this resuls a very slow operation. Here is code I arranged

#!/bin/sh

POI_FILES="poi_files.list"

find Db_Poi_Base/ -type f >"${POI_FILES}"

awk -F,   '$1>12.295 \
    && $1<12.579 \
    && $2>46.186 \
    && $2<46.289 {
        print $0 ",",FILENAME
    }' "${POI_FILES}" > "poi_base.txt"

# Looping from this point to end of script
awk -F,   '$1>13.136357 \
    && $1<13.156357 \
    && $2>45.679686 \
    && $2<45.699686 {
        print $0 ",",FILENAME
    }' "${POI_FILES}" > "poi_base1.txt"

if [ -s "poi_base.txt" && -s "poi_base1.txt" ]
then
    cat poi_base.txt poi_base1.txt |
        awk '!seen[$0]++' > poi_base2.txt
fi

if [ -s "poi_base2.txt" ]
then
    sed 's/\r//' poi_base2.txt > poi_base.txt
fi

First AWK searches inside identified files, to match result with specific parameters, then add a comma and filename of specific dbase, where data was found; finally save everything to a txt file.

Second AWK does the same things as first AWK with only exception of txt file name that is changed.

After that, two data files will be merged and passed to AWK to remove doubles and create a third file.

Finally, data will be checked to remove ^M character (carriage return), that could exist in same lines, naming resulting file same as in first line; this lets me loop with only 3 files.

Starting from second AWK, will loop with a change of parameters, until end of points to analyze.

Original gpx file have following structure

Index   Lat         Lon     
0   45.689686   N   45° 41.381160' N    13.146357   E   13° 8.781420' E
1   45.689657   N   45° 41.379420' N    13.146311   E   13° 8.778660' E

and Poi's Dbase file have following structure (a three columns cvs file separated by comma)

9.107150,39.219720,Moto Race Srl Accessori Abbigliament Cagliari>39070651423
9.141090,39.236280,Il Centauro Dei Fratelli Sanna Cocco Cagliari>39070492692
9.176310,39.241830,Planet Motors Aprilia Conce Quartu Sant'elena>39070881179

At the moment to generate my script I will import data on a open source spread sheet and manage it with some formulas. For example, putting +0.01 and -0,01 value (1 km) to Lat and Lon, to obtain range of search...

As everything works good, even if incredible slow, what I would like is to know, if there is a way to speed up, process. I have, also, tryed to work, even, with arrays as follow

IFS=$'\n'&&i=5491&&readarray -t my_array1 < <(find Db_Poi_Base/ -type f -exec awk -F, '$1>13.136357 && $1<13.156357 && $2>45.679686 && $2<45.699686 {print $0 ",",FILENAME}' {} \;)&&echo durata stimata: 91,5 minuti&&echo riga: $i ;((i=i-1))

readarray -t my_array2 < <(find Db_Poi_Base/ -type f -exec awk -F, '$1>12.946311 && $1<13.346311 && $2>45.489657 && $2<45.889657 {print $0 ",",FILENAME}' {} \;)&& my_array1+=(${my_array2[@]})&&echo riga: $i ;((i=i-1))

First line part (IFS=$'\n'&&i=5491&&) is used to avoid carriage return and set variable for a countdown to advise that script is in progress. Second part of every line (&&echo riga: $i;((i=i-1)) decrease variable after echoing line number.

Array solution works only partially, as it didn't sort ad remove doubles.

Expected Output is as follow

13.143540,45.688900,UD Lignano Sabbiadoro -(H R) Desiree>39043171415, Db_Poi_Base/Ristoranti_Senza_Glutine/Ristoranti_Senza_Glutine.csv
13.140550,45.688220,T La Conchiglia Lignano Sabbiadoro UD>39043173861, Db_Poi_Base/Tabacchi/Tabacchi.csv
13.142680,45.689260,T Sebastianis Laura Lignano Sabbiadoro UD>390431720656, Db_Poi_Base/Tabacchi/Tabacchi.csv
13.144810,45.691160,T Di Bella Salvatore Lignano Sabbiadoro UD>39043173820, Db_Poi_Base/Tabacchi/Tabacchi.csv
13.143750,45.691250,T Gusso Luigi Lignano Sabbiadoro UD>39043170187, Db_Poi_Base/Tabacchi/Tabacchi.csv
13.146660,45.691940,T Zamolo Metullio Lignano Sabbiadoro UD>39043170777, Db_Poi_Base/Tabacchi/Tabacchi.csv
13.148370,45.693080,T Passilongo Mario Lignano Sabbiadoro UD>39043171922, Db_Poi_Base/Tabacchi/Tabacchi.csv

Those up is my primary problem; As secondary curiosity, but really not a great one, as with spreadsheet, every thing is fast enough, was to implement gpx file directly in a script for example with a do while loop, that read every single row, from external gpx file, and put data of relatives colomn in command sequence.

Thanks for any suggestion.

Since the format for both files are identical, you could remove duplicates by simply doing this:

cat poi_base.txt poi_base1.txt | sort | uniq |

As for the general concept of project, I would suggest that each point of interest ( POI ) be assigned a unique ID , something like this:

poi_00903|9.107150,39.219720,Moto Race Srl Accessori Abbigliament Cagliari>39070651423
poi_00904|9.141090,39.236280,Il Centauro Dei Fratelli Sanna Cocco Cagliari>39070492692
poi_00905|9.176310,39.241830,Planet Motors Aprilia Conce Quartu Sant'elena>39070881179

That ID may or may not be related to the ID of the bike/walking path.

Perform a single massive task of idenfifying the IDs for POI within specified ranges of each path's coordinate checkpoint, and list those as additional items, in additional fields, on each item's line in modified GPX files (*.gpx_poi).

For example, if the original GPX file was named

Montessori_053.gpx

and one position line was

37   45.689657   N   45° 41.379420' N    13.146311   E   13° 8.778660' E

you could give the related POI file the name

Montessori_053.gpx_poi

and the related line could be formatted either (with coords) as

37   45.689657   N   45° 41.379420' N    13.146311   E   13° 8.778660' E ,poi_00903 poi_00904 poi__00907,poi_00792 poi_00795,poi_0348 poi_01009 poi_01201

or without coords as

37,poi_00903 poi_00904 poi__00907,poi_00792 poi_00795,poi_0348 poi_01009 poi_01201 

where the 1st, 2nd, 3rd grouping are the POIs within 1 km, 3km and 10km respectively.

Having the POI " pre-analyzed " and pre-compiled will make the application much more responsive because you've reduced the lookup effort to simply path-index matching for location of relevant " POI ID " that need to be extracted and displayed.

You might also want to create an " Itinerary Planner " where you present a unique reference to all the POIs identified along a given path, along with a checkbox . The user could then save that raw+checked list for later reference. When the user chooses the "Start Trek" for their trip, it would highlight the "checked" POI with a " pin flag label " at the closest point along those routes. You might want to give the option to revisit the original list to change the check-mark selections, if users change their minds about what they might want to visit or when. Just some thoughts.

You have a very good idea there. Good luck with your project!

I found a solution that works, is not elegant, but cut process time from 2 hours to 5 minutes.

First I use following command to create a file that reduce total data, extracting only data in specific range (max and min latitude and longitude).

find Db_Poi_Base/ -type f -exec awk -F, '$1>13.011409 && $1<13.783151 && $2>45.641559 && $2<46.10597 {print $0 ",",FILENAME}' {} > "poi_base-test.txt" \;

Than I changed first part of command I used above, setting to search inside specific file:

find poi_base-test.txt -type f -exec awk -F, '$1>13.136357 && $1<13.156357 && $2>45.679686 && $2<45.699686 {print $0}' {} > "poi_base.txt" \;&&i=5491&&echo Termina tra: $t minuti&&echo riga: $i ;((i=i-1));((t=i/45))

Result is correct and now is fast enough to my needings

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM