简体   繁体   中英

Search file for exact match of word list

There are many many questions surrounding this, some using regex, some using with open, and others but I have found none suitably fit my requirements.

I am opening a xml file which contains strings, 1 per line. eg

<string name="AutoConf_5">setup is in progress…</string>

I want to iterate over each line in the file and search each line for exact matches of words in a list. The current code seems to work and prints out matches but it doesn't do exact matches, eg 'pass' finds 'passed', 'pro' finds 'provide', 'process', 'proceed' etc

def stringRun(self,file):
    str_file = ['admin','premium','pro','paid','pass','password','api']
    with open(file, 'r') as sf:
        for s in sf:
            if any(x in str(s) for x in str_file):
                self.progressBox.AppendText(s)

Instead of using the function "in" which matches any substring in the line, you should use regex "re.search" I haven't checked it with python so minor syntax errors might have slipped in but this is the general idea, replace the if in your code with this:

if any(re.search(x, str(s)) for x in str_file):

Then you can use the power of regex to search for the words in the list with word boundaries. You need to add '\\b' to the beginning and end of each search string, or add to all in the condition:

if any(re.search(r'\b' + x + r'\b', str(s)) for x in str_file):

If you want an exact match, IMO, the best way is to prepare the strings to match and then search each string in each line.

For instances, you can prepare a mapping between tagged string and strings you want to match:

tagged = {'<string name="AutoConf_5">{0}</string>'.format(s): s
          for s in str_file}

This dict is an association between the tagged string you want to match and the actual string.

You can use it like that:

for line in sf:
    line = line.strip()
    if line in tagged:
        self.progressBox.AppendText(tagged[line])

Note: if any of your string contains "&", "<" or ">", you need to escape those characters, like this:

from xml.sax.saxutils import escape

tagged = {'<string name="AutoConf_5">{0}</string>'.format(escape(s)): s
          for s in str_file}

Another solution is to use lxml to parse your XML tree and find nodes which match a given xpath expression.

EDIT: match at least a word (form a words list)

You have a list of strings containing words. To match the XML content which contains at least of word of this list, you can use regular expression.

You may encounter 2 difficulties:

  • a XML content, parsed like a text file, can contains "&", "<" or ">". So you need to unescape the XML content.
  • some word from your words list may contains RegEx special characters (like "[" or "(") which must be escaped.

First, you can prepare a RegEx (and a function) to find all occurence of a word in a string. To do that, you can use "\\b" to match the empty string, but only at the beginning or end of a word:

str_file = ['admin', 'premium', 'pro', 'paid', 'pass', 'password', 'api']

re_any_word = r"\b(?:" + r"|".join(re.escape(e) for e in str_file) + r")\b"
find_any_word = re.compile(re_any_word, flags=re.DOTALL).findall

For instance:

>>> find_any_word("Time has passed")
[]
>>> find_any_word("I pass my exam, I'm a pro")
['pass', 'pro']

To extract the content of a XML fragment, you can also use a RegEx (even if it is not recommended in the general case, it worth it here):

The following RegEx (and function) matches a "<string>...</string>" fragment and select the content in the first group:

re_string = r'<string[^>]*>(.*?)</string>'
match_string = re.compile(re_string, flags=re.DOTALL).match

For instance:

>>> match_string('<string name="AutoConf_5">setup is in progress…</string>').group(1)
setup is in progress…

Now, all you have to do is to parse your file, line by line.

For the demo, I used a list of strings:

lines = [
    '<string name="AutoConf_5">setup is in progress…</string>\n',
    '<string name="AutoConf_5">it has passed</string>\n',
    '<string name="AutoConf_5">I pass my exam, I am a pro</string>\n',
]

for line in lines:
    line = line.strip()
    mo = match_string(line)
    if mo:
        content = saxutils.unescape(mo.group(1))
        words = find_any_word(content)
        if words:
            print(line + " => " + ", ".join(words))

You get:

<string name="AutoConf_5">I pass my exam, I am a pro</string> => pass, pro

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM