简体   繁体   English

Python正则表达式“负”模式匹配

[英]Python regex 'negative' pattern matching

I am working on a large batch of text strings, trying to match date times and convert them to MM-DD-YYYY format using strptime() function. 我正在处理大量文本字符串,尝试匹配日期时间,并使用strptime()函数将其转换为MM-DD-YYYY格式。

However, there are some 5-digit serial number appeared in the texts (eg, 90481) that have mislead my .findall() function to treat them as date times. 但是,文本中出现了一些5位序列号(例如90481),这些序列号误导了我的.findall()函数,将其视为日期时间。 How can I avoid them by including a ^() type of condition to exclude them? 如何通过包含^()类型的条件来排除它们来避免它们?

What them have in common is that they are all 5-digit, so I have tried ^(?!\\d{5}) but it didn't turn out well. 它们的共同点是它们都是5位数字,因此我尝试了^(?!\\ d {5}),但结果并不理想。 What's the best way to tackle this set of number? 解决这组数字的最佳方法是什么?

Thank you. 谢谢。

Note1: I have read this post , but can't seem to get it. 注意1:我已经阅读了这篇文章 ,但似乎无法理解。

Note2: about date format someone have asked in the comment section 注意2:关于日期格式的问题,有人在评论部分提出了要求

There are many date formats in the data frame I am working on, for example: 我正在处理的数据框中有许多日期格式,例如:

 05/10/2001; 05/10/01; 5/10/09; 6/2/01 May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001; 25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001 Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001 Feb 2001; Sep 2001; Oct 2001 5/2001; 11/2001 2001; 2015 

So I have a rather long .findall(r' ') function, but the main point is to avoid those 5-digit serial number from be selected. 所以我有一个相当长的.findall(r'')函数,但要点是避免选择那些5位数字的序列号。

Sincerely, 此致

You could use \\b in your regex, to avoid that a match is found halfway a number with more digits. 您可以在正则表达式中使用\\b ,以避免在数字较多的数字中间找到匹配项。 Place one at the start and one at the end, and make sure they are not included in the scope of the | 在开头放置一个,在结尾放置一个,并确保它们不包含在|范围内。 (OR) operation by wrapping the rest in a non-capture group. (OR)操作,将其余部分包装在非捕获组中。

I removed some months to keep it short: 我删除了几个月以使其简短:

\b(?:\d{1,2}\/\d{1,2}\/\d{2,4}|(?:Jan|Feb|Mar|Apr|   |Nov|Dec)[a-z]*-\d{2}-\d{2,4})\b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM