简体   繁体   中英

A more powerful method than Python's find? A regex issue?

I'm looking for a list of strings and their variations within a very large string.

What I want to do is find even the implicit matches between two strings.

For example, if my start string is foo-bar , I want the matching to find Foo-bAr foo Bar , or even foo(bar... . Of course, foo-bar should also return a match.


EDIT: More specifically, I need the following matches.

  1. The string itself, case insenstive.
  2. The string with spaces separating any of the characters
  3. The string with parentheses separating any of the characters.

How do I write an expression to meet these conditions?

I realize this might require some tricky regex. The thing is, I have a large list of strings I need to search for, and I feel regex is just the tool for making this as robust as I need.

Perhaps regex isn't the best solution?

Thanks for your help guys. I'm still learning to think in regex.

>>> def findString(inputStr, targetStr):
...     if convertToStringSoup(targetStr).find(convertToStringSoup(inputStr)) != -1:
...             return True
...     return False
... 
>>> def convertToStringSoup(testStr):
...     testStr = testStr.lower()
...     testStr = testStr.replace(" ", "")
...     testStr = testStr.replace("(", "")
...     testStr = testStr.replace(")", "")
...     return testStr
... 
>>> 
>>> findString("hello", "hello")
True
>>> findString("hello", "hello1")
True
>>> findString("hello", "hell!o1")
False
>>> findString("hello", "hell( o)1")
True

should work according to your specification. Obviously, could be optimized. You're asking about regex, which I'm thinking hard about, and will hopefully edit this question soon with something good. If this isn't too slow, though, regexps can be miserable, and readable is often better!

I noticed that you're repeatedly looking in the same big haystack. Obviously, you only have to convert that to "string soup" once!

Edit: I've been thinking about regex, and any regex you do would either need to have many clauses or the text would have to be modified pre-regex like I did in this answer. I haven't benchmarked string.find() vs re.find(), but I imagine the former would be faster in this case.

I'm going to assume that your rules are right, and your examples are wrong, mainly since you added the rules later, as a clarification, after a bunch of questions. So:

EDIT: More specifically, I need the following matches.

  1. The string itself, case insenstive.
  2. The string with spaces separating any of the characters
  3. The string with parentheses separating any of the characters.

The simplest way to do this is to just remove spaces and parens, then do a case-insensitive search on the result. You don't even need regex for that. For example:

haystack.replace(' ', '').replace('(', '').upper().find(needle.upper())

Try this regex:

[fF][oO]{2}[- ()][bB][aA][rR]

Test:

>>> import re
>>> pattern = re.compile("[fF][oO]{2}[- ()][bB][aA][rR]")
>>> m = pattern.match("foo-bar")
>>> m.group(0)
'foo-bar'

Using a regex, a case-insensitive search matches upper/lower case invariants, '[]' matches any contained characters and '|' lets you do multiple compares at once. Putting it all together, you can try:

import re
pairs = ['foo-bar', 'jane-doe']
regex = '|'.join(r'%s[ -\)]%s' % tuple(p.split('-')) for p in pairs)
print regex
results = re.findall(regex, your_text_here, re.IGNORECASE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM