简体   繁体   中英

regular expression dictionary [Google type search and match with regular expressions]

EDIT: One of the main problems with the code below is due to storing regular expression objects in dictionaries, and how to access them to see if they can match another string. But I will still leave my previous question because I think there's probably an easy way to do all of this.

I would like to find a method in python which knows how to return a boolean of whether or not two strings are referring to the same thing. I know that this is difficult, if not completely absurd in programming, but I am looking into dealing with this problem using a dictionary of alternative strings that refer to the same thing.

Here are some examples, since I know this doesn't make a whole lot of sense without them.

If I give the string:

'breakingBad.Season+01 Episode..02'

Then I would like it to match the string:

'Breaking Bad S01E02'

Or 'three.BuCkets+of H2O' can match '3 buckets of water'

I know this is nearly impossible to do with regard to '3' and 'water' etc. being synonymous, but I am willing to provide these as dictionaries of relevant regular expression synonyms to the function if need be.

I have a feeling that there is a much simpler way to do this in python, as there always is, but here is what I have so far:

import re

def check_if_match(given_string, string_to_match, alternative_dictionary):
    print 'matching: ', given_string, '  against: ', string_to_match
    # split the string into it's parts with pretty much any special character 
    list_of_given_strings = re.split(' |\+|\.|;|,|\*|\n', given_string)
    print 'List of words retrieved from given string: '
    print list_of_given_strings
    check = False
    counter = 0
    for i in range(len(list_of_given_strings)):
        m = re.search(list_of_given_strings[i], string_to_match, re.IGNORECASE)
        m_alt = None
        try:
            m_alt = re.search(alternative_dictionary[list_of_given_strings[i]], string_to_match, re.IGNORECASE)
        except KeyError:
            pass
        if m or m_alt:
            if counter == len(list_of_given_strings)-1: check = True
            else: counter += 1
            print list_of_given_strings[i], ' found to match'
        else:
            print list_of_given_strings[i], ' did not match'
            break
    return check

string1 = 'breaking Bad.Season+01 Episode..02'
other_string_to_check = 'Breaking.Bad.S01+E01'
# make a dictionary of synonyms -  here we should be saying that "S01" is equivalent to "Season 01"
alternative_dict = {re.compile(r'S[0-9]',flags=re.IGNORECASE):re.compile(r'Season [0-9]',flags=re.IGNORECASE),\
                    re.compile(r'E[0-9]',flags=re.IGNORECASE):re.compile(r'Episode [0-9]',flags=re.IGNORECASE)}
print check_if_match(string1, other_string_to_check, alternative_dict)
print 
# another try
string2 = 'three.BuCkets+of H2O'
other_string_to_check2 = '3 buckets of water'
alternative_dict2 = {'H2O':'water', 'three':'3'}
print check_if_match(string2, other_string_to_check2, alternative_dict2)

This returns:

matching:  breaking Bad.Season+01 Episode..02   against:  Breaking.Bad.S01+E01
List of words retrieved from given string: 
['breaking', 'Bad', 'Season', '01', 'Episode', '', '02']
breaking  found to match
Bad  found to match
Season  did not match
False

matching:  three.BuCkets+of H2O   against:  3 buckets of water
List of words retrieved from given string: 
['three', 'BuCkets', 'of', 'H2O']
three  found to match
BuCkets  found to match
of  found to match
H2O  found to match
True

I realize this probably means I am getting something wrong with the dictionary keys and values, but I feel like I am getting further away from a simple pythonic solution that has probably already been created.

Anyone have any thoughts?

I was tinkering with it and found some interesting things:

  • It might have to do with the way you are breaking up your initial words into lists

     matching: breaking Bad.Season 1.Episode.1 against: Breaking.Bad.S1+E1 List of words retrieved from given string: ['breaking', 'Bad', 'Season', '1', 'Episode', '1'] 
  • I think you want it to be ..., 'Season 1', ... instead of having 'Season' and 1 be separate entries in the list.

  • You specify S[0-9] , but this would not match double digits.

  • You are right about your regular expresions being stored in dictionaries; the mapping only applies in one direction. I was fiddling with the code (unfortunately don't remember what it was) by mapping r'Season [0-9]' to r'S[0-9]' instead of vice versa and it was able to match Season .

Suggestions

  • Instead of mapping, have an equivalence class for each string type (eg title, season, episode) and have some matcher code for that.
  • Separate the parse and compare steps. Parse each string individually into a common format or object and then do a comparison
  • You might need to implement some sort of state machine to know that you are processing a season and expect to see a number in a particular format right after it.
  • You may want to use a third party tool instead; I've heard good things about Renamer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM