简体   繁体   中英

Repeating a python regular expression until a certain char

I want to get all of the text until a ! appears. Example

some textwfwfdsfosjtortjk\n
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf\n
sfsgdfgdfgdgdfgdg\n
!

The number of lines before the ! changes so I can't hardcode a reg exp like this

"+\n^.+\n^.+"

I am using re.MULTLINE, but should I be using re.DOTALL?

Thanks

Why does this need a regular expression?

index = str.find('!')
if index > -1:
    str = str[index:] # or (index+1) to get rid of the '!', too

So you want to match everything from the beginning of the input up to (but not including) the first ! character? This should do it:

re.match(r'[^!]*', input)

If there are no exclamation points this will match the whole string. If you want to match only strings with ! in them, add a lookahead:

re.match(r'[^!]*(?=!)', input)

The MULTILINE flag is not needed because there are no anchors ( ^ and $ ), and DOTALL isn't needed because there are no dots.

Following the Python philosophy of "Easier to Ask Forgiveness Than Permission" ( EAFP ), I suggest you create a subroutine which is easy to understand and later maintain, should your separator change.

SEPARATOR = u"!"
def process_string(s):
 try:
  return s[:s.index(SEPARATOR)]
 except ValueError:
  return s

This function will return the string from the beginning up to, and not including, whatever you defined as separator. If the separator is not found, it will return the whole string. The function works regardless of new lines. If your separator changes, simply change SEPARATOR and you are good to go.

ValueError is the exception raised when you request the index of a character not in the string (try it in the command line: "Hola".index("1") (will raise ValueError: substring not found ). The workflow then assumes that most of the time you expect the SEPARATOR character to be in the string, so you attempt that first without asking for permission (testing if SEPARATOR is in the string); if you fail (the index method raises ValueError ) then you ask forgiveness (return the string as originally received). This approach (EAFP) is considered Pythonic when it applies, as it does in this case.

No regular expressions needed; this is a simple problem.

Look into a 'lookahead' for that particular character you're reading, and match the whole first part as a pattern instead.

I'm not sure exactly how Python's regex reader is different from Ruby, but you can play with it in rubular.com

Maybe something like:

([^!]*(?=\!))

(Just tried this, seems to work)

它应该做的工作。

re.compile('(.*?)!', re.DOTALL).match(yourString).group(1)

I think you're making this more complex than it needs to be. Your reg exp just needs to say "repeat(any character except !) followed by !". Remember [^!] means "any character except !".

So, like this:

>>> import re
>>> rexp = re.compile("([^!]*)!")
>>> test = """sdasd
... asdasdsa
... asdasdasd
... asdsadsa
... !"""
>>> rexp.findall(test)
['sdasd\nasdasdsa\nasdasdasd\nasdsadsa\n']
>>> 

re.DOTALL should be sufficient:

import re
text = """some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg
!"""
rExp = re.compile("(.*)\!", re.S)
print rExp.search(text).groups()[0]

some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM