简体   繁体   中英

Finding text in a string matching patterns

I have a text/csv file that contains , amongst others, rows that look like this:

05:21:20PM   Driving 46 84.0         Some Road; Some Ext 1; in SomePLace; Long 38 12 40.6 E Lat 29 2 47.2 S

There are other rows containing data that I am not after.

I am only looking to extract the timestamp, and then the LatLong .

The only thing constant in the rows I am interested in is the timstamp at the beginning, that is always 8 characters long and ends with PM or AM, and then the Lat/Long that starts with the word "Long" and ends in an "S".

Is there any way that I can run through this file and only strip out these two peices of text, concatenate them into a new row, and ignoring all other rows that does not have the timestamp as first entry AND the Lat/Long part at the end ( some rows have a timestamp in beginning but not the lat/long)

Use the csv module to parse out the rows, then split the last column on ; to get the lat/long coordinates:

with open(inputfilename, 'rb') as inputfh:
    reader = csv.reader(inputfh, delimiter='\t')
    for row in reader:
        timestamp = row[0]
        lat_long = row[2].rpartition(';')[-1].strip()

This assumes that the file is tab-separated and that the latitute/longitude entry is always the last ; semi-colon separated value in the 3rd column

I do not recommend using regular expressions if your data is in CSV format because this is not going to be pretty and regular expressions are the wrong tool for CSV . But because your data does not look like a true CSV format, parsing it using regular expressions might be an option and this code would work for the sample you have provided:

import re

with open('inputfilename', 'rU') as f:
    for line in f:
        mat = re.match("(\d+):(\d+):(\d+)([AP]M).*Long\s+([^EW]+[EW]).*Lat\s+([^NS]+[NS])", line)
        if mat is not None:
            print mat.groups()

result:

('05', '21', '20', 'PM', '38 12 40.6 E', '29 2 47.2 S')

Further processing of this result is left as an exercise, but it could look like this:

hour, minute, second, am_pm, long, lat = mat.groups()
>>> s = "05:21:20PM   Driving 46 84.0         Some Road; Some Ext 1; in SomePLace; Long 38 12 40.6 E Lat 29 2 47.2 S"
>>> date = s.split(" ")[0]
>>> date
'05:21:20PM'
>>> long_start = "Long"
>>> lat_start = "Lat"
>>> longtitude = s[s.find(long_start) + len(long_start): s.find(lat_start)]
>>> longtitude 
' 38 12 40.6 E '
>>> latitude = s[s.find(lat_start) + len(lat_start):]
>>> 
>>> latitude
' 29 2 47.2 S'
>>> latitude = s[s.find(lat_start) + len(lat_start):].strip()
>>> latitude
'29 2 47.2 S'
>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM