简体   繁体   中英

Regex to split line data into year / temperature readings

I'm writing a Python script to parse some data files I have into geojson data.

Right now, I have a number of lines that each start with a year and then have 12 temperature readings (one for each month) for example:

1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1 
1984   1.9   0.5   2.8   8.9  13.7  15.0  16.9  19.2  13.5  11.3   4.6   0.7 
1985  -5.0  -2.8   4.0   8.8  15.6  15.2  19.0  18.4  14.3   9.9   2.0   4.4 
1986   0.4  -6.4   3.8   7.4  15.9  17.4  19.4  18.2  12.3  10.3   7.1   2.5 

Etc. I'm trying to write a regex ideally so that the year will go into the first capture group and then either all the temperatures will go into the next group, or they will go into individual groups. In the first situation, I'll just split based on spaces and then parse them individually. In the second, I'll just parse each capture group one by one.

I've tried this right now and it's not working (scaled down example to demonstrate):

import re
reYear = re.compile("([0-9][0-9][0-9][0-9])([\s]*[\-]*[0-9]+[\s]*)*")
line = "1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1"
data = reYear.search(line)
print("GROUP 0: %s" % data.group(0))
print("GROUP 1: %s" % data.group(1))

This is the output I get:

GROUP 0: 1983   5
GROUP 1: 1983

I thought this might work because the first () group says capture 4 digits, and the second says capture some instances of either a minus sign (or not), some numbers, and then some whitespace. However I don't really know what I'm doing. Appreciate any help.

Thank you!

I suggest using .* for matching the remainder of the line. Also, \\d{4} is the simplest way to match four digits:

import re

# Regex: (four digits) whitespace (the rest of the line)
reYear = re.compile("(\d{4})\s+(.*)")
line = "1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1"
data = reYear.search(line)

# Group 0 is everything
print("GROUP 0: %s" % data.group(0))

print("GROUP 1: %s" % data.group(1))
print("GROUP 2: %s" % data.group(2))

This outputs:

GROUP 0: 1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1
GROUP 1: 1983
GROUP 2: 5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1    

Having said all that, you could just split the whole line on whitespace and take the first element as the year, and not use a regex at all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM