简体   繁体   English

正则表达式查找不包含某些字符的数字

[英]Regex to find digits not followed by certain characters

I currently have a text field that contains information about times that are to be used for scheduling purposes. 我目前有一个文本字段,其中包含有关将用于计划目的的时间的信息。 As it is a text field the data is unstructured and is in many different formats. 由于它是文本字段,因此数据是非结构化的,并且具有许多不同的格式。 Examples of data include: 数据示例包括:

  • Mon-Wed 6-7:30pm 周一至周六6-7:30pm
  • Tuesday/Thurs 5:00 - 6:30 星期二/星期四5:00-6:30
  • M/T/W 3:30 -7 M / T / W 3:30 -7
  • F 4-5 F 4-5

As such I am trying to write a parser to turn these into usable data points. 因此,我试图编写一个解析器以将其转换为可用的数据点。 I am working on the time components at the moment. 我目前正在研究时间部分。 In order to structure the data and have the ability to pass it into the dateutil parser I want to "fill out" all times. 为了构造数据并能够将其传递到dateutil解析器中,我想一直“填写”。 6 would become 6:00, 7 would become 7:00 etc. To do so I am trying to use the regex expression: 6将变为6:00,7将变为7:00,以此类推。为此,我尝试使用正则表达式:

reg = re.compile('[\d]([^:]|$)')

The idea is to get any digit that either does not have a : after it, or is at the end of the line. 这个想法是要得到在它之后或在行末没有数字的任何数字。 However, I realized that this will get too many data points as in the first example it would get the '3' of 7:30 and the 0 of 7:30. 但是,我意识到这将获得太多的数据点,因为在第一个示例中它将获得7:30的“ 3”和7:30的0。

What would be a better way to convert this data to a usable format? 将数据转换为可用格式的更好方法是什么?

I would do it in two-stage manner, harnessing one interesting feature of re.split , sample data: 我将分两阶段进行,利用re.split一个有趣的功能, re.split示例数据:

line1 = 'Mon-Wed 6-7:30pm'
line2 = 'Tuesday/Thurs 5:00 - 6:30'
line3 = 'M/T/W 3:30 -7'
line4 = 'F 4-5'

Function: 功能:

def add_zeros(line):
    parts = re.split(r'(\d{1,2}:\d{1,2})',line)
    parts[::2] = [re.sub(r'(\d{1,2})',r'\1:00',p) for p in parts[::2]]
    return ''.join(parts)

Usage: 用法:

print(add_zeros(line1)) # Mon-Wed 6:00-7:30pm
print(add_zeros(line2)) # Tuesday/Thurs 5:00 - 6:30
print(add_zeros(line3)) # M/T/W 3:30 -7:00
print(add_zeros(line4)) # F 4:00-5:00

Explanation: 说明:

I give re.sub the first argument within the group. 我给re.sub的第一个参数。 re.split gives a list with odd-indexed elements being separators . re.split给出了一个list其中奇数索引的元素为分隔符 With the pattern I used in re.split the seperators are "ready" hours (which do not need zero-padding). 使用我在re.split使用的模式, 分隔符为“就绪”小时(不需要零填充)。 I then use re.sub on every even-indexed element of list (the non "ready" hours), treating every 1 or 2 digit number as an hour and replacing it with the number followed by :00 然后,我在列表的每个偶数索引元素(非“就绪”小时)上使用re.sub ,将每个1或2位数字视为一个小时,然后将其替换为数字,然后加上:00

您可以使用负向后看和负向前看(?<!(:)\\d)\\d(?!(:|\\d)) https://regex101.com/r/nAQh3e/4这将选择数字之前或之后没有数字并且还没有的数字:

I think it will be much easier to find the incorrect time after replacing the correct time with a placeholder. 我认为用占位符替换正确的时间后,找到错误的时间会容易得多。 Then we can correct the incorrect time format and then again substitute the placeholder with actual times 然后,我们可以纠正错误的时间格式,然后再次将占位符替换为实际时间

Here is simple implementation, you can tweak this to match your need 这是简单的实现,您可以根据需要进行调整

import re

texts = ["Mon-Wed 6-7:30pm",
"Tuesday/Thurs 5:00 - 6:30",
"M/T/W 3:30 -7",
"F 4-5",]

def get_placeholder_replacer(replaced_strings):
    def replace_with_placeholder(x):
        replaced_strings.append(x[0])
        return "{}"
    return replace_with_placeholder


ptrn_correct_time = re.compile(r"\d+:\d+")
ptrn_incorrect_time = re.compile(r"\d{1,2}")

for text in texts:
    replaced_strings = []
    placeholder_replacer = get_placeholder_replacer(replaced_strings)
    new_text = ptrn_correct_time.sub(placeholder_replacer,text)
    new_text = ptrn_incorrect_time.sub(lambda x: "{}:00".format(x[0]), new_text)

    print(new_text.format(*replaced_strings))

## Output
# Mon-Wed 6:00-7:30pm
# Tuesday/Thurs 5:00 - 6:30
# M/T/W 3:30 -7:00
# F 4:00-5:00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 正则表达式在创建组和某些字符后跟数字时挂起 - Python regex is hung up when creating a group and certain characters followed by digits 正则表达式查找关键字后跟 N 个字符 - Regex find keyword followed by N characters 尝试使用正则表达式查找后跟空格或破折号的所有数字 - trying to find all digits that are followed by either a whitespace or a dash using regex python重新匹配数字或数字后跟字符 - python re match digits or digits followed by characters 时间戳记的python正则表达式,后跟字符 - python regex for timestamp followed by characters 在正则表达式中获取带有一个连字符后跟4位数字的链接 - Get links with one hypen followed by 4 digits in regex 正则表达式匹配数字组后跟或不跟空格,单词 - Regex match groups of digits followed or not by spaces, words 正则表达式,用于匹配单词后跟斜杠和10位数字 - RegEx for matching a word followed by slash and 10 digits 在 Python 中使用正则表达式查找具有某些字符和不包含其他字符的单词 - Using Regex in Python to find words with certain characters and without other characters 如何将正则表达式匹配限制为精确的子字符串,但某些字符除外,只要它们后跟一个空格或换行符? - How do I constrict regex matches to exact substrings with the exception of certain characters as long as they are followed by a space or new line?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM