EDIT: One of the main problems with the code below is due to storing regular expression objects in dictionaries, and how to access them to see if they can match another string. But I will still leave my previous question because I think there's probably an easy way to do all of this.
I would like to find a method in python which knows how to return a boolean of whether or not two strings are referring to the same thing. I know that this is difficult, if not completely absurd in programming, but I am looking into dealing with this problem using a dictionary of alternative strings that refer to the same thing.
Here are some examples, since I know this doesn't make a whole lot of sense without them.
If I give the string:
'breakingBad.Season+01 Episode..02'
Then I would like it to match the string:
'Breaking Bad S01E02'
Or 'three.BuCkets+of H2O'
can match '3 buckets of water'
I know this is nearly impossible to do with regard to '3'
and 'water'
etc. being synonymous, but I am willing to provide these as dictionaries of relevant regular expression synonyms to the function if need be.
I have a feeling that there is a much simpler way to do this in python, as there always is, but here is what I have so far:
import re
def check_if_match(given_string, string_to_match, alternative_dictionary):
print 'matching: ', given_string, ' against: ', string_to_match
# split the string into it's parts with pretty much any special character
list_of_given_strings = re.split(' |\+|\.|;|,|\*|\n', given_string)
print 'List of words retrieved from given string: '
print list_of_given_strings
check = False
counter = 0
for i in range(len(list_of_given_strings)):
m = re.search(list_of_given_strings[i], string_to_match, re.IGNORECASE)
m_alt = None
try:
m_alt = re.search(alternative_dictionary[list_of_given_strings[i]], string_to_match, re.IGNORECASE)
except KeyError:
pass
if m or m_alt:
if counter == len(list_of_given_strings)-1: check = True
else: counter += 1
print list_of_given_strings[i], ' found to match'
else:
print list_of_given_strings[i], ' did not match'
break
return check
string1 = 'breaking Bad.Season+01 Episode..02'
other_string_to_check = 'Breaking.Bad.S01+E01'
# make a dictionary of synonyms - here we should be saying that "S01" is equivalent to "Season 01"
alternative_dict = {re.compile(r'S[0-9]',flags=re.IGNORECASE):re.compile(r'Season [0-9]',flags=re.IGNORECASE),\
re.compile(r'E[0-9]',flags=re.IGNORECASE):re.compile(r'Episode [0-9]',flags=re.IGNORECASE)}
print check_if_match(string1, other_string_to_check, alternative_dict)
print
# another try
string2 = 'three.BuCkets+of H2O'
other_string_to_check2 = '3 buckets of water'
alternative_dict2 = {'H2O':'water', 'three':'3'}
print check_if_match(string2, other_string_to_check2, alternative_dict2)
This returns:
matching: breaking Bad.Season+01 Episode..02 against: Breaking.Bad.S01+E01
List of words retrieved from given string:
['breaking', 'Bad', 'Season', '01', 'Episode', '', '02']
breaking found to match
Bad found to match
Season did not match
False
matching: three.BuCkets+of H2O against: 3 buckets of water
List of words retrieved from given string:
['three', 'BuCkets', 'of', 'H2O']
three found to match
BuCkets found to match
of found to match
H2O found to match
True
I realize this probably means I am getting something wrong with the dictionary keys and values, but I feel like I am getting further away from a simple pythonic solution that has probably already been created.
Anyone have any thoughts?
I was tinkering with it and found some interesting things:
It might have to do with the way you are breaking up your initial words into lists
matching: breaking Bad.Season 1.Episode.1 against: Breaking.Bad.S1+E1 List of words retrieved from given string: ['breaking', 'Bad', 'Season', '1', 'Episode', '1']
I think you want it to be ..., 'Season 1', ...
instead of having 'Season'
and 1
be separate entries in the list.
You specify S[0-9]
, but this would not match double digits.
r'Season [0-9]'
to r'S[0-9]'
instead of vice versa and it was able to match Season
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.