简体   繁体   English

Python正则表达式将字符串匹配为模式并返回数字

[英]Python Regex to match a string as a pattern and return number

I have some lines that represent some data in a text file. 我有一些行代表文本文件中的某些数据。 They are all of the following format: 它们都是以下格式:

s = 'TheBears      SUCCESS Number of wins : 14'

They all begin with the name then whitespace and the text 'SUCCESS Number of wins : ' and finally the number of wins, n1. 它们都以名称开头,然后是空格,以及文本“ SUCCESS Number of wins:”,最后是胜利数n1。 There are multiple strings each with a different name and value. 有多个字符串,每个字符串都有不同的名称和值。 I am trying to write a program that can parse any of these strings and return the name of the dataset and the numerical value at the end of the string. 我正在尝试编写一个程序,可以解析这些字符串中的任何一个,并在字符串的末尾返回数据集的名称和数值。 I am trying to use regular expressions to do this and I have come up with the following: 我试图使用正则表达式来做到这一点,我想出了以下几点:

import re
def winnumbers(s):
    pattern = re.compile(r"""(?P<name>.*?)     #starting name
                             \s*SUCCESS        #whitespace and success
                             \s*Number\s*of\s*wins  #whitespace and strings
                             \s*\:\s*(?P<n1>.*?)""",re.VERBOSE)
    match = pattern.match(s)

    name = match.group("name")
    n1 = match.group("n1")

    return (name, n1)

So far, my program can return the name, but the trouble comes after that. 到目前为止,我的程序可以返回名称,但是麻烦在此之后。 They all have the text "SUCCESS Number of wins : " so my thinking was to find a way to match this text. 他们都有“ SUCCESS Number of wins:”的文字,所以我的想法是找到一种匹配此文字的方法。 But I realize that my method of matching an exact substring isn't correct right now. 但是我意识到我匹配精确子字符串的方法现在不正确。 Is there any way to match a whole substring as part of the pattern? 有什么办法可以将整个子字符串作为模式的一部分进行匹配? I have been reading quite a bit on regular expressions lately but haven't found anything like this. 最近,我在正则表达式上阅读了很多,但是没有找到类似的东西。 I'm still really new to programming and I appreciate any assistance. 我对编程仍然很陌生,感谢您的帮助。

Eventually, I will use float() to return n1 as a number, but I left that out because it doesn't properly find the number in the first place right now and would only return an error. 最终,我将使用float()返回n1作为数字,但我将其遗漏了,因为它现在不能正确地首先找到数字,并且只会返回错误。

Try this one out: 试试这个:

((\S+)\s+SUCCESS Number of wins : (\d+))

These are the results: 结果如下:

>>> regex = re.compile("((\S+)\s+SUCCESS Number of wins : (\d+))")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xc827cf478a56b350>
>>> regex.match(string)
<_sre.SRE_Match object at 0xc827cf478a56b228>

# List the groups found
>>> r.groups()
(u'TheBears SUCCESS Number of wins : 14', u'TheBears', u'14')

# List the named dictionary objects found
>>> r.groupdict()
{}

# Run findall
>>> regex.findall(string)
[(u'TheBears SUCCESS Number of wins : 14', u'TheBears', u'14')]
# So you can do this for the name and number:
>>> fullstring, name, number = r.groups()

If you don't need the full string just remove the surround parenthesis. 如果您不需要完整的字符串,只需删除括号。

I believe that there is no actual need to use a regex here. 我相信这里并没有实际需要使用正则表达式。 So you can use the following code if it acceptable for you(note that i have posted it so you will have ability to have another one option): 因此,如果可以接受,则可以使用以下代码(请注意,我已经发布了它,因此您将可以选择另一种方法):

dict((line[:line.lower().index('success')+1], line[line.lower().index('wins:') + 6:]) for line in text.split('\n') if 'success' in line.lower())

OR in case of you are sure that all words are splitted by single spaces: 或者,如果您确定所有单词都用单个空格分隔:

output={}
for line in text:
    if 'success' in line.lower():
        words = line.strip().split(' ')
        output[words[0]] = words[-1]

If the text in the middle is always constant, there is no need for a regular expression. 如果中间的文本始终是恒定的,则不需要正则表达式。 The inbuilt string processing functions will be more efficient and easier to develop, debug and maintain. 内置的字符串处理功能将更加高效,并且更易于开发,调试和维护。 In this case, you can just use the inbuilt split() function to get the pieces, and then clean the two pieces as appropriate: 在这种情况下,您可以只使用内置的split()函数来获取片段,然后根据需要清理两个片段:

>>> def winnumber(s):
...     parts = s.split('SUCCESS Number of wins : ')
...     return (parts[0].strip(), int(parts[1]))
... 
>>> winnumber('TheBears      SUCCESS Number of wins : 14')
('TheBears', 14)

Note that I have output the number of wins as an integer (as presumably this will always be a whole number), but you can easily substitute float() - or any other conversion function - for int() if you desire. 请注意,我已经将获胜次数输出为整数(大概总是整数),但是如果需要,您可以轻松地将float()或任何其他转换函数替换为int()

Edit : Obviously this will only work for single lines - if you call the function with several lines it will give you errors. 编辑 :显然,这仅适用于单行-如果您用多行调用函数,则会给您错误。 To process an entire file, I'd use map() : 要处理整个文件,我将使用map()

>>> map(winnumber, open(filename, 'r'))
[('TheBears', 14), ('OtherTeam', 6)]

Also, I'm not sure of your end use for this code, but you might find it easier to work with the outputs as a dictionary: 另外,我不确定该代码的最终用途,但是您可能会发现将输出用作字典更容易:

>>> dict(map(winnumber, open(filename, 'r')))
{'OtherTeam': 6, 'TheBears': 14}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM