简体   繁体   中英

Splitting a list with certain parameters in Python. Using re.findall

import re

def processFile(filename='Names.txt', encode='utf-8'):
    listOfPlayers = []
    listOfInfo = []
    count = 0
    with open(filename, 'r', encoding = encode) as f:
        for line in f.readlines():
            if count == 0:
                listOfInfo.append(line.strip())
                count += 1
            elif count == 1:
                listOfInfo.append(line.strip())
                listOfPlayers.append(listOfInfo)
                count -= 1
                listOfInfo = []
    return listOfPlayers

def splitStats(listOfPlayers):
    newList = []
    for item in (i[1] for i in listOfPlayers):
        m = re.findall('[A-Z][a-z]*', item)
        newList.append(m)
    print(newList)    

def main():
    lOP = processFile()
    splitStats(lOP)

if __name__ == '__main__':
    main()

I'm trying to look at some stats for soccer and took some stats from a webpage and am trying to split each player up with there position, country, where they transferred from, where they transferred to, and the money that was payed for them.

My Names.txt file looks like:

Donyell Malen
AttackerNetherlandsArsenalAjaxUndisclosed
Petr Cech
GoalkeeperCzech Rep.ArsenalChelsea14million
Scott Sinclair
MidfielderEnglandAston VillaManchester City3.4million

My listOfPlayers from my processFile has a list of lists. With the player as index zero and the rest of the information like this:

[['Donyell Malen', 'AttackerNetherlandsArsenalAjaxUndisclosed'], ['Petr Cech', 'GoalkeeperCzech Rep.ArsenalChelsea14million'], ['Scott Sinclair', 'MidfielderEnglandAston VillaManchester City3.4million'],

I'm trying to parse through the the each item and the 1 index to split it up. I found the re.findall() method, but have searched API for an hour and still don't have a clear picture on how to separate from capitals (Although the code is there to do that) I need to keep any two words with a space between as one string. ie "Aston Villa" should be kept together, and how to keep there fees ie "3.4million" as 3.4 million.

I know this is a pretty long question, but I wanted to give a good overview just to see if I was going about this all wrong or if I'm on the right track and just need help with the re.findall(). Thanks!

you could use the following pattern

"(?:[A-Z]|[0-9]+(?:.[0-9]+)?)[a-z]*(?: [A-Z][a-z]*)*"

it's pretty complex as it basically handles all the special cases and you should dig into the documentation for re module if you are interested about how to write such expressions https://docs.python.org/2/library/re.html

I think what you're going to want to look into is a negative (and /or positive) "lookbehind" in your regex. I'm thinking something like this:

([A-Z][a-z]*)((?<!\s)[A-Z][a-z\s]*(?<=\s)[A-Z][a-z]*)*

but, i'm terrible at regex and can just visually see this is sloppy, so i look forward to someone correcting me:) anyway, while i'm sure this can be done way better, the

(?<!\s)

represents a lookbehind for any time the previous character is NOT a white-space character.. just like:

(?<=\s)

represents a lookbehind for any time the previous character IS a white-space character.

I went to https://regex101.com/ and used the regex i provided at the top for the regex, and the line:

MidfielderEnglandAston VillaManchester City3.4million

as the text to match against, and it was looking pretty promising.. i didn't address anything about the digits you need to account for for the '3.4million' attribute that exists, but was hoping this might be helpful and i can't spend anymore time digging in :/

best of luck! regex is super fun and powerful and i wish i knew more!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM