正则表达式查找不包含某些字符的数字

Question

I currently have a text field that contains information about times that are to be used for scheduling purposes. 我目前有一个文本字段，其中包含有关将用于计划目的的时间的信息。 As it is a text field the data is unstructured and is in many different formats. 由于它是文本字段，因此数据是非结构化的，并且具有许多不同的格式。 Examples of data include: 数据示例包括：

Mon-Wed 6-7:30pm 周一至周六6-7：30pm
Tuesday/Thurs 5:00 - 6:30 星期二/星期四5:00-6:30
M/T/W 3:30 -7 M / T / W 3:30 -7
F 4-5 F 4-5

As such I am trying to write a parser to turn these into usable data points. 因此，我试图编写一个解析器以将其转换为可用的数据点。 I am working on the time components at the moment. 我目前正在研究时间部分。 In order to structure the data and have the ability to pass it into the dateutil parser I want to "fill out" all times. 为了构造数据并能够将其传递到dateutil解析器中，我想一直“填写”。 6 would become 6:00, 7 would become 7:00 etc. To do so I am trying to use the regex expression: 6将变为6：00，7将变为7:00，以此类推。为此，我尝试使用正则表达式：

reg = re.compile('[\d]([^:]|$)')

The idea is to get any digit that either does not have a : after it, or is at the end of the line. 这个想法是要得到在它之后或在行末没有数字的任何数字。 However, I realized that this will get too many data points as in the first example it would get the '3' of 7:30 and the 0 of 7:30. 但是，我意识到这将获得太多的数据点，因为在第一个示例中它将获得7:30的“ 3”和7:30的0。

What would be a better way to convert this data to a usable format? 将数据转换为可用格式的更好方法是什么？

Answer 1

I would do it in two-stage manner, harnessing one interesting feature of re.split , sample data: 我将分两阶段进行，利用re.split一个有趣的功能， re.split示例数据：

line1 = 'Mon-Wed 6-7:30pm'
line2 = 'Tuesday/Thurs 5:00 - 6:30'
line3 = 'M/T/W 3:30 -7'
line4 = 'F 4-5'

Function: 功能：

def add_zeros(line):
    parts = re.split(r'(\d{1,2}:\d{1,2})',line)
    parts[::2] = [re.sub(r'(\d{1,2})',r'\1:00',p) for p in parts[::2]]
    return ''.join(parts)

Usage: 用法：

print(add_zeros(line1)) # Mon-Wed 6:00-7:30pm
print(add_zeros(line2)) # Tuesday/Thurs 5:00 - 6:30
print(add_zeros(line3)) # M/T/W 3:30 -7:00
print(add_zeros(line4)) # F 4:00-5:00

Explanation: 说明：

I give re.sub the first argument within the group. 我给re.sub的第一个参数。 re.split gives a list with odd-indexed elements being separators . re.split给出了一个list其中奇数索引的元素为分隔符 。 With the pattern I used in re.split the seperators are "ready" hours (which do not need zero-padding). 使用我在re.split使用的模式， 分隔符为“就绪”小时（不需要零填充）。 I then use re.sub on every even-indexed element of list (the non "ready" hours), treating every 1 or 2 digit number as an hour and replacing it with the number followed by :00 然后，我在列表的每个偶数索引元素（非“就绪”小时）上使用re.sub ，将每个1或2位数字视为一个小时，然后将其替换为数字，然后加上:00

Answer 2

您可以使用负向后看和负向前看(?<!(:)\\d)\\d(?!(:|\\d)) https://regex101.com/r/nAQh3e/4这将选择数字之前或之后没有数字并且还没有的数字:

Answer 3

I think it will be much easier to find the incorrect time after replacing the correct time with a placeholder. 我认为用占位符替换正确的时间后，找到错误的时间会容易得多。 Then we can correct the incorrect time format and then again substitute the placeholder with actual times 然后，我们可以纠正错误的时间格式，然后再次将占位符替换为实际时间

Here is simple implementation, you can tweak this to match your need 这是简单的实现，您可以根据需要进行调整

import re

texts = ["Mon-Wed 6-7:30pm",
"Tuesday/Thurs 5:00 - 6:30",
"M/T/W 3:30 -7",
"F 4-5",]

def get_placeholder_replacer(replaced_strings):
    def replace_with_placeholder(x):
        replaced_strings.append(x[0])
        return "{}"
    return replace_with_placeholder


ptrn_correct_time = re.compile(r"\d+:\d+")
ptrn_incorrect_time = re.compile(r"\d{1,2}")

for text in texts:
    replaced_strings = []
    placeholder_replacer = get_placeholder_replacer(replaced_strings)
    new_text = ptrn_correct_time.sub(placeholder_replacer,text)
    new_text = ptrn_incorrect_time.sub(lambda x: "{}:00".format(x[0]), new_text)

    print(new_text.format(*replaced_strings))

## Output
# Mon-Wed 6:00-7:30pm
# Tuesday/Thurs 5:00 - 6:30
# M/T/W 3:30 -7:00
# F 4:00-5:00

正则表达式查找不包含某些字符的数字

问题描述

3 个解决方案

解决方案1
3 已采纳 2019-08-13 18:12:16

解决方案2
2 2019-08-13 17:26:37

解决方案3
0 2019-08-13 17:52:27

正则表达式查找不包含某些字符的数字

问题描述

3 个解决方案

解决方案1 3 已采纳 2019-08-13 18:12:16

解决方案2 2 2019-08-13 17:26:37

解决方案3 0 2019-08-13 17:52:27

解决方案1
3 已采纳 2019-08-13 18:12:16

解决方案2
2 2019-08-13 17:26:37

解决方案3
0 2019-08-13 17:52:27