简体   繁体   English

Python正则表达式Lookahead过冲模式

[英]Python Regular expression Lookahead overshooting pattern

I'm trying to pull the data contained within FTP LIST. 我正在尝试提取FTP LIST中包含的数据。

I'm using regex within Python 2.7. 我在Python 2.7中使用正则表达式。

test = "-rw-r--r--   1 owner    group        75148624 Jan  6  2015 somename.csv-rw-r--r--   1 owner    group       223259072 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         4041411 Jun  5  2015 somename-adjusted.csv-rw-r--r--   1 owner    group         2879228 May 13  2015 somename.csv-rw-r--r--   1 owner    group        11832668 Feb 13  2015 somename.csv-rw-r--r--   1 owner    group         1510522 Feb 19  2015 somename.csv-rw-r--r--   1 owner    group         2826664 Feb 25  2015 somename.csv-rw-r--r--   1 owner    group          582985 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group          212427 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         3015592 Feb 27  2015 somename.csv-rw-r--r--   1 owner    group          103576 Feb 27  2015    somename-corrected.csv"

(now without code formatting so you can see it without scrolling) (现在没有代码格式,所以你可以看到它而不滚动)

test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv" test =“ - rw-r - r-- 1所有者组75148624 2015年1月6日somename.csv-rw-r - r-- 1所有者组223259072 2015年2月26日somename.csv-rw-r - r-- 1所有者组4041411 2015年6月5日somename-adjusted.csv-rw-r - r-- 1所有者组2879228 2015年5月13日somename.csv-rw-r - r-- 1所有者组11832668 2015年2月13日somename.csv -rw-r - r-- 1所有者组1510522 2015年2月19日somename.csv-rw-r - r-- 1所有者组2826664 2015年2月25日somename.csv-rw-r - r-- 1所有者组582985 2015年2月26日somename.csv-rw-r - r-- 1所有者组212427 2015年2月26日somename.csv-rw-r - r-- 1所有者组3015592 2015年2月27日somename.csv-rw-r- -r-- 1所有者组103576 2015年2月27日somename-corrected.csv“

I've tried various incarnations of the following 我尝试了以下各种化身

from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
                      '(?P<links>[0-9]{1,8})[\s]{1,20}'
                      '(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<size>[0-9]{1,16})[\s]{1,20}'
                      '(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
                      '(?P<date>[0-9]{1,2})[\s]{1,20}'
                      '(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
                      '(?P<filename>[\s\w\.\-]+)(?=[drwx\-]{10})')

with the last line as 最后一行为

'(?P<filename>.+)(?=[drwx\-]{10})')

'(?P<filename>.+(?=[drwx\-]{10}))')

and originally, 原来,

'(?P<filename>[\s\w\.\-]+(?=[drwx\-]{10}|$))') 

so i can capture the last entry 所以我可以捕获最后一个条目

but regardless, I keep getting the following output 但无论如何,我一直得到以下输出

ftp_list_re.findall(test)

[('-rw-r--r--',
  '1',
  'owner',
  'group',
  '75148624',
  'Jan',
  '6',
  '2015',
  'somename.csv-rw-r--r--   1 owner    group       223259072 Feb 26  2015     somename.csv-rw-r--r--   1 owner    group         4041411 Jun  5  2015 somename-adjusted.csv-rw-r--r--   1 owner    group         2879228 May 13  2015 somename.csv-rw-r--r--   1 owner    group        11832668 Feb 13  2015 somename.csv-rw-r--r--   1 owner    group         1510522 Feb 19  2015 somename.csv-rw-r--r--   1 owner    group         2826664 Feb 25  2015 somename.csv-rw-r--r--   1 owner    group          582985 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group          212427 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         3015592 Feb 27  2015 somename.csv')]

What am I doing wrong? 我究竟做错了什么?

You should make sub-pattern before lookahead non-greedy. 你应该在前瞻性非贪婪之前制作子模式。 Further your regex can be shortened a bit like this: 你的正则表达式可以缩短一点:

(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>\d{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>\d{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)

Or using compile : 或使用compile

from re import compile

ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})\s{1,20}'
   '(?P<links>\d{1,8})\s{1,20}'
   '(?P<owner>[\w-]{1,16})\s{1,20}'
   '(?P<group>[\w-]{1,16})\s{1,20}'
   '(?P<size>\d{1,16})\s{1,20}'
   '(?P<month>[A-Za-z]{0,3})\s{1,20}'
   '(?P<date>\d{1,2})\s{1,20}'
   '(?P<timeyear>[\d:]{4,5})\s{1,20}'
   '(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')

RegEx Demo RegEx演示

Code: 码:

import re
p = re.compile(ur'(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>[0-9]{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>[0-9]{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
test_str = u"-rw-r--r--   1 owner    group        75148624 Jan  6  2015 somename.csv-rw-r--r--   1 owner    group       223259072 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         4041411 Jun  5  2015 somename-adjusted.csv-rw-r--r--   1 owner    group         2879228 May 13  2015 somename.csv-rw-r--r--   1 owner    group        11832668 Feb 13  2015 somename.csv-rw-r--r--   1 owner    group         1510522 Feb 19  2015 somename.csv-rw-r--r--   1 owner    group         2826664 Feb 25  2015 somename.csv-rw-r--r--   1 owner    group          582985 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group          212427 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         3015592 Feb 27  2015 somename.csv-rw-r--r--   1 owner    group          103576 Feb 27  2015 somename-corrected.csv"

re.findall(p, test_str)

Regular expression quantifiers are by default "greedy" which means that they will "eat" as much as possible. 正则表达式量词默认为“贪婪”,这意味着他们将尽可能“吃”。

[\s\w\.\-]+

means to find at least one AND AS MANY AS POSSIBLE of whitespace, word, dot, or dash characters. 意味着找到至少一个AND AS MANY AS可能的空白,字,点或短划线字符。 The look ahead prevents it from eating the entire input (actually the regex engine will eat the entire input and then start backing off as needed), which means that it eats each file specification line, except for the last (which the look ahead insists must be left). 向前看可以防止它占用整个输入(实际上正则表达式引擎将占用整个输入,然后根据需要开始退出),这意味着它吃掉了每个文件规范行,除了最后一个(前瞻性必须坚持必须离开)。

Adding a ? 添加一个? after a quantifier (*?, +?, ??, and so on) makes the quantifier "lazy" or "reluctant". 在量词(*?,+?,??等)之后使量词“懒惰”或“不情愿”。 This changes the meaning of "+" from "match at least one and as many as possible" to "match at least one and no more than necessary". 这将“+”的含义从“匹配至少一个且尽可能多”改为“匹配至少一个且不超过必要”。

Therefore changing that last + to a +? 因此将最后的+改为+? should fix your problem. 应该解决你的问题。

The problem wasn't with the look ahead, which works just fine, but with the last subexpression before it. 问题不在于前瞻,它可以很好地工作,但是在它之前的最后一个子表达式。

EDIT: 编辑:

Even with this change, your regular expression will not parse that last file specification line. 即使进行了此更改,您的正则表达式也不会解析最后一个文件规范行。 This is because the regular expressions INSISTS that there must be a permission spec after the filename. 这是因为正则表达式INSISTS必须在文件名后面有权限规范。 To fix this, we must allow that look ahead to not match (but require it to match at everything BUT the last specification). 要解决这个问题,我们必须允许前瞻不匹配(但要求它匹配所有内容但是最后一个规范)。 Making the following change will fix that 进行以下更改将解决这个问题

ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
                      '(?P<links>[0-9]{1,8})[\s]{1,20}'
                      '(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<size>[0-9]{1,16})[\s]{1,20}'
                      '(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
                      '(?P<date>[0-9]{1,2})[\s]{1,20}'
                      '(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
                      '(?P<filename>[\s\w\.\-]+?)(?=(?:(?:[drwx\-]{10})|$))')

What I have done here (besides making that last + lazy) is to make the lookahead check two possibilities - either a permission specification OR an end of string. 我在这里所做的(除了最后+懒惰之外)是让前瞻检查两种可能性 - 一个权限规范或一个字符串的结尾。 The ?: are to prevent those parentheses from capturing (otherwise you will end up with undesired extra data in your matches). ?:是为了防止这些括号被捕获(否则你最终会在匹配中得到不需要的额外数据)。

Fixed your last line, filename group was not working. 修复了你的最后一行,文件名组无效。 See fixed regex and the demo below: 请参阅下面的固定正则表达式和演示:

(?P<permissions>[d-][rwx-]{9})[\s]{1,20}
                      (?P<links>[0-9]{1,8})[\s]{1,20}
                      (?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}
                      (?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}
                      (?P<size>[0-9]{1,16})[\s]{1,20}
                      (?P<month>[A-Za-z]{0,3})[\s]{1,20}
                      (?P<date>[0-9]{1,2})[\s]{1,20}
                      (?P<timeyear>[0-9:]{4,5})[\s]{1,20}
                      (?P<filename>[\w\-]+.\w+)

Demo here: 在这里演示

With the PyPi regex module that allows to split with an empty match, you can do the same in a more simple way, without having to describe all fields: 使用允许使用空匹配进行拆分的PyPi regex模块 ,您可以以更简单的方式执行相同操作,而无需描述所有字段:

import regex

fields = ('permissions', 'links', 'owner', 'group', 'size', 'month', 'day', 'year', 'filename')
p = regex.compile(r'(?=[d-](?:[r-][w-][x-]){3})', regex.V1)
res = [dict(zip(fields, x.split(None, 9))) for x in p.split(test)[1:]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM