简体   繁体   English

使用正则表达式提取字符串位置Python

[英]Using regex to extract string position Python

I'm trying to extract the position (index) of a substring using regex. 我正在尝试使用正则表达式提取子字符串的位置(索引)。 I need to use regex because the string won't be exactly the same. 我需要使用正则表达式,因为字符串不会完全相同。 I want to get the position of the substring (either starting or ending position), so I can take the 1,000 characters following that substring. 我想获取子字符串的位置(开始位置或结束位置),因此我可以在该子字符串后面取1,000个字符。

For example, if I had "while foreign currencies are traded frequently, very little money is made by most." 例如,如果我有“虽然外汇交易频繁,但大多数人却赚很少的钱”。 I want to find the position of "foreign currencies" so I can get all the words after. 我想找到“外国货币”的头寸,这样我可以得到所有的话。

f5 is the text. f5是文本。

I've tried: 我试过了:

p = re.compile("((^\s*|\.\s*)foreign\s*(currency|currencies))?")
for m in p.finditer(f5):
    print m.start(), m.group()

to get the location. 获取位置。 This gives me (0,0) even though I've checked to make sure the regex picks up what I'm looking for in the text. 即使我已检查以确保正则表达式能提取文本中要查找的内容,这也给了我(0,0)。

I've also tried: 我也尝试过:

location = re.search(r"((^\s*|\.\s*)foreign\s*(currency|currencies))?", f5)
print location

Output is <_sre.SRE_Match at 0x297d3328> 输出为<_sre.SRE_Match at 0x297d3328>

If I try 如果我尝试

location.span() 

I get (0,0) again. 我又得到(0,0)。

Basically, I want to convert <_sre.SRE_Match at 0x297d3328> into an integer that gives the location of the search term. 基本上,我想将<_sre.SRE_Match at 0x297d3328>转换为一个给出搜索词位置的整数。

I've spent half a day searching for a solution. 我花了半天时间寻找解决方案。 Thanks for any help. 谢谢你的帮助。

Your pattern includes everything before the word "foreign". 您的模式包括“外国”一词之前的所有内容。 So python will consider that part of your match. 因此,python会考虑匹配的那部分。 If you want to discard that, simply remove it from your search string. 如果您想丢弃它,只需将其从搜索字符串中删除。

Try: 尝试:

 p = re.compile('foreign\s+(currency|currencies)?')
 m = p.search(s)
 m.start()

This also works with finditer : 这也适用于finditer

 for m in p.finditer(s):
     m.start()

In addition to previous solutions/comments, if you want all the words after, you can just do something like: 除了以前的解决方案/评论之外,如果您想在后面加上所有单词,则可以执行以下操作:

>>> location = re.search(r".*foreign\s*currenc(y|ies)(.*)", f5)
>>> location.group(2)
' are traded frequently, very little money is made by most.'

the .group(2) part matches the (.*) in the regexp. .group(2)部分与正则表达式中的(.*)匹配。

Don't have much experience in Python, so I can't directly answer your question. 没有大量的Python经验,所以我无法直接回答您的问题。 But if you want the substring starting with the match, why don't you just match the rest of the string OR remove everything before the match. 但是,如果您希望子字符串以匹配开头,为什么不匹配其余字符串或删除匹配之前的所有内容呢?

Example 1: 范例1:

Match foreign currenc(y|ies) followed by every other character in the String. 匹配foreign currenc(y|ies)然后匹配字符串中的每个其他字符。 I used the s modifier so that the dot matches new lines as well. 我使用了s修饰符,以便点也与新行匹配。

foreign\s+currenc(?:y|ies).*

Example 2: 范例2:

Replace this expression with an empty String. 将此表达式替换为空的String。 This will lazily match everything up until the lookahead of foreign currenc(y|ies) is matched. 这将使所有内容延迟匹配,直到匹配foreign currenc(y|ies)的前瞻为止。

.*?(?=foreign\s+currenc(?:y|ies))

Note: I changed (currency|currencies) to currenc(?:y|ies) because it is slightly more efficient . 注:我改变(currency|currencies) ,以currenc(?:y|ies)因为它更有效

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM