简体   繁体   English

正则表达式将线路数据分为年份/温度读数

[英]Regex to split line data into year / temperature readings

I'm writing a Python script to parse some data files I have into geojson data. 我正在编写一个Python脚本来将我拥有的一些数据文件解析为geojson数据。

Right now, I have a number of lines that each start with a year and then have 12 temperature readings (one for each month) for example: 现在,我有很多行,每行以一年开始,然后有12个温度读数(每个月一个),例如:

1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1 
1984   1.9   0.5   2.8   8.9  13.7  15.0  16.9  19.2  13.5  11.3   4.6   0.7 
1985  -5.0  -2.8   4.0   8.8  15.6  15.2  19.0  18.4  14.3   9.9   2.0   4.4 
1986   0.4  -6.4   3.8   7.4  15.9  17.4  19.4  18.2  12.3  10.3   7.1   2.5 

Etc. I'm trying to write a regex ideally so that the year will go into the first capture group and then either all the temperatures will go into the next group, or they will go into individual groups. 等等,我正在尝试理想地编写一个正则表达式,以便使年份进入第一个捕获组,然后将所有温度归入下一个捕获组,或者将它们归为单独的组。 In the first situation, I'll just split based on spaces and then parse them individually. 在第一种情况下,我将基于空间进行拆分,然后分别对其进行解析。 In the second, I'll just parse each capture group one by one. 在第二篇文章中,我将逐个解析每个捕获组。

I've tried this right now and it's not working (scaled down example to demonstrate): 我现在已经尝试过了,但它不起作用(按比例缩小示例进行演示):

import re
reYear = re.compile("([0-9][0-9][0-9][0-9])([\s]*[\-]*[0-9]+[\s]*)*")
line = "1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1"
data = reYear.search(line)
print("GROUP 0: %s" % data.group(0))
print("GROUP 1: %s" % data.group(1))

This is the output I get: 这是我得到的输出:

GROUP 0: 1983   5
GROUP 1: 1983

I thought this might work because the first () group says capture 4 digits, and the second says capture some instances of either a minus sign (or not), some numbers, and then some whitespace. 我认为这可能可行,因为第一个()组说捕获4位数字,第二个组说捕获负号(或不负号)的一些实例,一些数字,然后捕获空白。 However I don't really know what I'm doing. 但是我真的不知道我在做什么。 Appreciate any help. 感谢任何帮助。

Thank you! 谢谢!

I suggest using .* for matching the remainder of the line. 我建议使用。*来匹配行的其余部分。 Also, \\d{4} is the simplest way to match four digits: 另外,\\ d {4}是匹配四位数字的最简单方法:

import re

# Regex: (four digits) whitespace (the rest of the line)
reYear = re.compile("(\d{4})\s+(.*)")
line = "1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1"
data = reYear.search(line)

# Group 0 is everything
print("GROUP 0: %s" % data.group(0))

print("GROUP 1: %s" % data.group(1))
print("GROUP 2: %s" % data.group(2))

This outputs: 输出:

GROUP 0: 1983   5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1
GROUP 1: 1983
GROUP 2: 5.2  -0.4   5.7   9.8  13.7  18.1  22.1  19.8  15.1  10.2   4.8   1.1    

Having said all that, you could just split the whole line on whitespace and take the first element as the year, and not use a regex at all. 说了这么多,您可以将整个行分隔在空白处,并将第一个元素作为年份,而根本不使用正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM