简体   繁体   English

在Python中使用某些参数拆分列表。 使用re.findall

[英]Splitting a list with certain parameters in Python. Using re.findall

import re

def processFile(filename='Names.txt', encode='utf-8'):
    listOfPlayers = []
    listOfInfo = []
    count = 0
    with open(filename, 'r', encoding = encode) as f:
        for line in f.readlines():
            if count == 0:
                listOfInfo.append(line.strip())
                count += 1
            elif count == 1:
                listOfInfo.append(line.strip())
                listOfPlayers.append(listOfInfo)
                count -= 1
                listOfInfo = []
    return listOfPlayers

def splitStats(listOfPlayers):
    newList = []
    for item in (i[1] for i in listOfPlayers):
        m = re.findall('[A-Z][a-z]*', item)
        newList.append(m)
    print(newList)    

def main():
    lOP = processFile()
    splitStats(lOP)

if __name__ == '__main__':
    main()

I'm trying to look at some stats for soccer and took some stats from a webpage and am trying to split each player up with there position, country, where they transferred from, where they transferred to, and the money that was payed for them. 我正在尝试查看足球的一些统计数据,并从网页上获取了一些统计数据,并试图将每个球员的位置,国家/地区,他们从哪里转移,到哪里转移以及为他们支付的钱分成多少。

My Names.txt file looks like: 我的Names.txt文件如下所示:

Donyell Malen
AttackerNetherlandsArsenalAjaxUndisclosed
Petr Cech
GoalkeeperCzech Rep.ArsenalChelsea14million
Scott Sinclair
MidfielderEnglandAston VillaManchester City3.4million

My listOfPlayers from my processFile has a list of lists. 我的processFile中的listOfPlayers具有列表列表。 With the player as index zero and the rest of the information like this: 将播放器的索引设为零,其余信息如下:

[['Donyell Malen', 'AttackerNetherlandsArsenalAjaxUndisclosed'], ['Petr Cech', 'GoalkeeperCzech Rep.ArsenalChelsea14million'], ['Scott Sinclair', 'MidfielderEnglandAston VillaManchester City3.4million'],

I'm trying to parse through the the each item and the 1 index to split it up. 我试图解析每个项目和1索引以将其拆分。 I found the re.findall() method, but have searched API for an hour and still don't have a clear picture on how to separate from capitals (Although the code is there to do that) I need to keep any two words with a space between as one string. 我找到了re.findall()方法,但是已经搜索了一个小时的API,但仍然不清楚如何将其与大写字母分开(尽管有这样做的代码),我需要保留两个单词一串之间的空格。 ie "Aston Villa" should be kept together, and how to keep there fees ie "3.4million" as 3.4 million. 即“阿斯顿维拉别墅”应该放在一起,以及如何保留那里的费用,即“ 340万”等于340万。

I know this is a pretty long question, but I wanted to give a good overview just to see if I was going about this all wrong or if I'm on the right track and just need help with the re.findall(). 我知道这是一个很长的问题,但是我想给出一个很好的概述,只是看看我是不是做错了所有事情,或者我是否处在正确的轨道上,并且只需要re.findall()帮助。 Thanks! 谢谢!

you could use the following pattern 您可以使用以下模式

"(?:[A-Z]|[0-9]+(?:.[0-9]+)?)[a-z]*(?: [A-Z][a-z]*)*"

it's pretty complex as it basically handles all the special cases and you should dig into the documentation for re module if you are interested about how to write such expressions https://docs.python.org/2/library/re.html 它非常复杂,因为它基本上可以处理所有特殊情况,如果您对如何编写这样的表达式感兴趣,则应该深入阅读re模块的文档。https://docs.python.org/2/library/re.html

I think what you're going to want to look into is a negative (and /or positive) "lookbehind" in your regex. 我认为您要研究的是正则表达式中的否定(和/或肯定)“向后看”。 I'm thinking something like this: 我在想这样的事情:

([A-Z][a-z]*)((?<!\s)[A-Z][a-z\s]*(?<=\s)[A-Z][a-z]*)*

but, i'm terrible at regex and can just visually see this is sloppy, so i look forward to someone correcting me:) anyway, while i'm sure this can be done way better, the 但是,我在regex上很糟糕,只能从视觉上看到这很草率,所以我期待有人纠正我:)无论如何,虽然我确信可以做得更好,但是

(?<!\s)

represents a lookbehind for any time the previous character is NOT a white-space character.. just like: 表示在任何时候前身都不是空格字符的后视。

(?<=\s)

represents a lookbehind for any time the previous character IS a white-space character. 表示在任何时候前一个字符都是空格字符的后向。

I went to https://regex101.com/ and used the regex i provided at the top for the regex, and the line: 我去了https://regex101.com/并使用了我在顶部为正则表达式提供的正则表达式,以及以下行:

MidfielderEnglandAston VillaManchester City3.4million

as the text to match against, and it was looking pretty promising.. i didn't address anything about the digits you need to account for for the '3.4million' attribute that exists, but was hoping this might be helpful and i can't spend anymore time digging in :/ 作为要匹配的文本,它看起来很有希望。.我没有解决您需要考虑存在的'340万'属性的数字的任何问题,但希望这可能对您有所帮助,我可以不要再花时间在:/

best of luck! 祝你好运! regex is super fun and powerful and i wish i knew more! regex超级有趣且功能强大,我希望我能了解更多!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM