[英]Regex to find digits not followed by certain characters
I currently have a text field that contains information about times that are to be used for scheduling purposes. 我目前有一个文本字段,其中包含有关将用于计划目的的时间的信息。 As it is a text field the data is unstructured and is in many different formats.
由于它是文本字段,因此数据是非结构化的,并且具有许多不同的格式。 Examples of data include:
数据示例包括:
As such I am trying to write a parser to turn these into usable data points. 因此,我试图编写一个解析器以将其转换为可用的数据点。 I am working on the time components at the moment.
我目前正在研究时间部分。 In order to structure the data and have the ability to pass it into the
dateutil
parser I want to "fill out" all times. 为了构造数据并能够将其传递到
dateutil
解析器中,我想一直“填写”。 6 would become 6:00, 7 would become 7:00 etc. To do so I am trying to use the regex expression: 6将变为6:00,7将变为7:00,以此类推。为此,我尝试使用正则表达式:
reg = re.compile('[\d]([^:]|$)')
The idea is to get any digit that either does not have a : after it, or is at the end of the line. 这个想法是要得到在它之后或在行末没有数字的任何数字。 However, I realized that this will get too many data points as in the first example it would get the '3' of 7:30 and the 0 of 7:30.
但是,我意识到这将获得太多的数据点,因为在第一个示例中它将获得7:30的“ 3”和7:30的0。
What would be a better way to convert this data to a usable format? 将数据转换为可用格式的更好方法是什么?
I would do it in two-stage manner, harnessing one interesting feature of re.split
, sample data: 我将分两阶段进行,利用
re.split
一个有趣的功能, re.split
示例数据:
line1 = 'Mon-Wed 6-7:30pm'
line2 = 'Tuesday/Thurs 5:00 - 6:30'
line3 = 'M/T/W 3:30 -7'
line4 = 'F 4-5'
Function: 功能:
def add_zeros(line):
parts = re.split(r'(\d{1,2}:\d{1,2})',line)
parts[::2] = [re.sub(r'(\d{1,2})',r'\1:00',p) for p in parts[::2]]
return ''.join(parts)
Usage: 用法:
print(add_zeros(line1)) # Mon-Wed 6:00-7:30pm
print(add_zeros(line2)) # Tuesday/Thurs 5:00 - 6:30
print(add_zeros(line3)) # M/T/W 3:30 -7:00
print(add_zeros(line4)) # F 4:00-5:00
Explanation: 说明:
I give re.sub
the first argument within the group. 我给
re.sub
的第一个参数。 re.split
gives a list
with odd-indexed elements being separators . re.split
给出了一个list
其中奇数索引的元素为分隔符 。 With the pattern I used in re.split
the seperators are "ready" hours (which do not need zero-padding). 使用我在
re.split
使用的模式, 分隔符为“就绪”小时(不需要零填充)。 I then use re.sub
on every even-indexed element of list (the non "ready" hours), treating every 1 or 2 digit number as an hour and replacing it with the number followed by :00
然后,我在列表的每个偶数索引元素(非“就绪”小时)上使用
re.sub
,将每个1或2位数字视为一个小时,然后将其替换为数字,然后加上:00
您可以使用负向后看和负向前看(?<!(:)\\d)\\d(?!(:|\\d))
https://regex101.com/r/nAQh3e/4这将选择数字之前或之后没有数字并且还没有的数字:
I think it will be much easier to find the incorrect time after replacing the correct time with a placeholder. 我认为用占位符替换正确的时间后,找到错误的时间会容易得多。 Then we can correct the incorrect time format and then again substitute the placeholder with actual times
然后,我们可以纠正错误的时间格式,然后再次将占位符替换为实际时间
Here is simple implementation, you can tweak this to match your need 这是简单的实现,您可以根据需要进行调整
import re
texts = ["Mon-Wed 6-7:30pm",
"Tuesday/Thurs 5:00 - 6:30",
"M/T/W 3:30 -7",
"F 4-5",]
def get_placeholder_replacer(replaced_strings):
def replace_with_placeholder(x):
replaced_strings.append(x[0])
return "{}"
return replace_with_placeholder
ptrn_correct_time = re.compile(r"\d+:\d+")
ptrn_incorrect_time = re.compile(r"\d{1,2}")
for text in texts:
replaced_strings = []
placeholder_replacer = get_placeholder_replacer(replaced_strings)
new_text = ptrn_correct_time.sub(placeholder_replacer,text)
new_text = ptrn_incorrect_time.sub(lambda x: "{}:00".format(x[0]), new_text)
print(new_text.format(*replaced_strings))
## Output
# Mon-Wed 6:00-7:30pm
# Tuesday/Thurs 5:00 - 6:30
# M/T/W 3:30 -7:00
# F 4:00-5:00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.