简体   繁体   中英

python regex text extraction

the text input is something like this West Team 4, Eastern 3\n

-------Update--------

the input is a txt file containing team name and scores like a football game the whole text file will be something like this, two names and scores:

West Team 4, Eastern 5
Nott Team 2, Eastern 3
West wood 1, Eathan 2
West Team 4, Eas 5

I am using with open to read file line by line therefore there will be \n at the end of the line.

I would like to extract this line of text in to something like:

['West Team', 'Eastern']

What I currently have in mind is to use regex

result = re.sub("[\n^\s$\d]", "", text).split(",")

this code results in this:

['WestTeam','Eastern']

I'm sure that my regex is not correct. I want to remove '\n' and any number including the space in front of the number but not the space in the middle of the name.

Open to any suggestion that to achieve this result, doesn't necessarily use regex.

You can use a non-regex approach to keep any letters/spaces after splitting with a comma:

text = "West Team 4, Eastern 3\n"
print( ["".join(c for c in x if c.isalpha() or c.isspace()).strip() for x in text.split(',')]  )
# => ['West Team', 'Eastern']

Or a regex approach to remove any chars other than ASCII letters and spaces matched with the [^a-zA-Z\s]+ pattern:

import re
rx = re.compile(r'[^a-zA-Z\s]+')
print( [rx.sub("", x).strip() for x in text.split(',')]  )
# => ['West Team', 'Eastern']

Another similar solution can be used to extract one or more non-digit char chunks after an optional comma + whitespaces:

print(re.findall(r',?\s*(\D*[^\d\s])', text))

See the Python demo .

In case there are consecutive non-letter chunks you can use

import re
text = "West Team 4, Eastern 3\n, test 23 99 test"
rx = re.compile(r'[^\W\d_]+')
print( [" ".join(rx.findall(x)) for x in text.split(',')]  )

See the Python demo yielding ['West Team', 'Eastern', 'test test'] . The [^\W\d_]+ pattern matches any one or more Unicode letters.

So many ways this can be done, but looking at your data you could use rstrip() quite nicely:

s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip('\n 0123456789') for x in s.split(', ')]
print(lst)

Or maybe rather use:

from string import digits
s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip(digits+'\n ') for x in s.split(', ')]
print(lst)

Both options print:

['West Team', 'Eastern']

Actually re.findall might work well here:

inp = "West Team 4, Eastern 3\n"
matches = re.findall(r'(\w+(?: \w+)*) \d+', inp)
print(matches)  # ['West Team', 'Eastern']

The split version, using re.split :

inp = "West Team 4, Eastern 3\n"
matches = [x for x in re.split(r'\s+\d+\s*,?\s*', inp) if x != '']
print(matches)  # ['West Team', 'Eastern']

You want to:

  • remove '\n' and
  • any number including the space in front of the number
  • but not the space in the middle of the name.

Functions to use:

  • for constant parts you could just replace using str.replace() .
  • for all dynamic matches we need a regex to substitute with empty-string using re.sub() .
  • for surroundings we can even use str.strip() to remove leading and trailing whitespaces like \n .

Code

import re

input = "West Team 4, Eastern 3\n"

cleaned = re.sub(r'\s+\d', '', input)  # remove numbers with leading spaces
cleaned = cleaned.strip()  # remove surrounding whitespace like \n
print(cleaned)

output = cleaned.split(",") 
print(output)

Prints:

West Team, Eastern
['West Team', 'Eastern']
import re

text = 'West Team 4, Eastern 3\n'

result = re.sub("[\n^$\d]", "", text).split(",")

# REMOVE THE LEADING AND TRAILING SPACES:
result = [x.strip() for x in result]
print(result)
# result: ['West Team', 'Eastern']

You haven't clearly defined the rules for getting the required output from your sample input. However, this will give what you've asked for but may not cover all eventualities:

in_string = 'West Team 4, Eastern 3\n'

result = [' '.join(t.split()[:-1]) for t in in_string.split(',')]

print(result)

Output:

['West Team', 'Eastern']

You can remove the digits and replace possible double spaced gaps with a single space.

Then split on a comma, do not keep empty values and trim the output:

import re

s = "West Team 4 , Eastern 3, test 23 99 test\n,"

res = [
    m.strip() for m in re.sub(r"[^\S\n]{2,}", " ", re.sub(r"\d+", "", s)).split(",") if m
]
print(res)

Output

['West Team', 'Eastern', 'test test']

See a Python demo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM