简体   繁体   English

Python使用正则表达式来查找子字符串的起始位置

[英]Python using a regex to find start position of a substring

I need to find the position of a substring within a string. 我需要在字符串中找到子字符串的位置。

The substring is the characters ",0*" followed by tow characters that are [0-9] or [AF] ie 子字符串是字符“,0 *”,后跟两个字符,即[0-9]或[AF]即

kdjrnnj,0*B3;,w0l44
       ^^^^^
qui8ecc),0*21qxxcd4))
        ^^^^^

The substring is always exactly 5 characters in length. 子字符串的长度始终为5个字符。 There are always some number (inknown) of chars before the substring. 在子字符串之前总是有一些数字(墨水)字符。 There may or may not be chars after the substring. 子字符串后面可能有也可能没有字符。

I'd like to use re.something to find the starting position of my substring within the string. 我想使用re.something来查找字符串中子字符串的起始位置。 My regex knowledge is quite poor - if someone could tell me how to do this you'd save me hours of hacking around. 我的正则表达式知识很差 - 如果有人能告诉我如何做到这一点,你就可以节省我几个小时的黑客行为。

Thanks 谢谢

Use match object 's start() method: 使用match对象start()方法:

>>> r = re.compile(r',0\*[0-9A-F]{2}')
>>> m = r.search("kdjrnnj,0*B3;,w0l44")
>>> if m : print m.start()
7
>>> m = r.search("qui8ecc),0*21qxxcd4))")
>>> if m : print m.start()
8

Next step is to remove everything after the substring 下一步是删除子字符串后的所有内容

You don't need index for that, that can be done with regex too: 你不需要索引,也可以用正则表达式完成:

>>> strs = "qui8ecc),0*21qxxcd4))"
>>> re.search(r'.*?,0\*[0-9A-F]{2}', strs).group()
'qui8ecc),0*21'

>>> m = r.search("kdjrnnj,0*B3;,w0l44")
>>> if m : print m.group()
kdjrnnj,0*B3

re.search is faster than re.sub here: re.searchre.sub 更快

>>> strs = 'kdjrnnj,0*B3;,w0l44'
>>> %timeit r.search(strs).group()
100000 loops, best of 3: 1.42 us per loop
>>> %timeit pattern.sub('', strs)
100000 loops, best of 3: 2.79 us per loop

>>> strs = 'kdjrnnj,0*B3;,w0l44'*1000
>>> %timeit r.search(strs).group()
100000 loops, best of 3: 1.43 us per loop
>>> %timeit pattern.sub('', strs)
10000 loops, best of 3: 59.9 us per loop

>>> strs = 'kdjrnnj'*1000 + ',0*B3;,w0l44'
>>> %timeit r.search(strs).group()
1000 loops, best of 3: 260 us per loop
>>> %timeit pattern.sub('', strs)
1000 loops, best of 3: 410 us per loop

Python re.search() returns a MatchObject() instance when a match is made, it includes a .start() method to give you the matched position: Python re.search()在匹配时返回一个MatchObject()实例 ,它包含一个.start()方法,为您提供匹配的位置:

import re

pattern = re.compile(r',0\*[0-9A-F]{2}')

match = pattern.search(inputstring)
if match:
    print match.start()

Note the \\* though; 注意\\*虽然; an asterisk ( * ) is a regular expression metacharacter, so it needs to be escaped with a slash to match a literal * . 星号( * )是正则表达式元字符,因此需要使用斜杠进行转义以匹配文字*

The [0-9A-F] defines a character class that matches any character in the two named ranges, and the {2} following the class limits it to matching exactly two characters. [0-9A-F]定义了一个与两个命名范围中的任何字符匹配的字符类,并且该类后面的{2}将其限制为恰好匹配两个字符。

Demo: 演示:

>>> import re
>>> pattern = re.compile(r',0\*[0-9A-F]{2}')
>>> match = pattern.search('kdjrnnj,0*B3;,w0l44')
>>> match.start()
7
>>> match.group()
',0*B3'
>>> match = pattern.search('qui8ecc),0*21qxxcd4))')
>>> match.start()
8
>>> match.group()
',0*21'

If you need to remove everything after this string, use re.sub() instead: 如果您需要删除此字符串后的所有内容,请使用re.sub()代替:

pattern = re.compile(r'(?<=,0\*[0-9A-F]{2}).*')

newstring = pattern.sub('', oldstring)

This uses a look-behind assertion; 这使用了一个后视断言; it looks for your pattern, then matches everything that follows , and the re.sub() call then removes what was matched from the inputstring. 它会查找您的模式,然后匹配re.sub()所有内容 ,然后re.sub()调用将从输入字符串中删除匹配的内容。

Demo: 演示:

>>> pattern = re.compile(r'(?<=,0\*[0-9A-F]{2}).*')
>>> pattern.sub('', 'kdjrnnj,0*B3;,w0l44')
'kdjrnnj,0*B3'
>>> pattern.sub('', 'qui8ecc),0*21qxxcd4))')
'qui8ecc),0*21'

Note how everything after ,0*B3 and ,0*21 is gone now. 注意,0*B3,0*21内容现在都消失了。

这个的正则表达式应该非常简单: .*,0\\*[0-9A-F]{2}

使用re.search()

re.search(r',0*[0-9A-F][0-9A-F]', your_string).start()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM