I am converting IPA symbols to a system inspired by ARPABET
eg:
'oːg' > 'oo g'
In this test example, I can reach the desired result by doing the following:
>>>re.sub(r'(.)ː', r'\1\1 ', 'oːg')
>>>'oo g'
I understand 'r' here is essential so that the backslash '\\' is not escaped and regex can treat '\\1' as back reference.
What I want is to be able to iterate through a dict (created from a csv file) that contains many regex rules like this:
mappings = {'(.)ː': '\1\1 ','foo': 'bar', ..}
where I look for patterns stored in the dict keys within each of my IPA words, and do a re.sub using the corresponding values.
simply put, I want this:
>>>pattern = '(.)ː'
>>>replpattern = '\1\1 '
>>>ipa = 'oːg'
>>>arpa = re.sub(pattern, replpattern, ipa)
>>>print(arpa)
>>>'oo g'
The tricky part is to get python to treat all the different patterns as raw strings.
Following suggestion from a similar thread -- casting raw strings python -- I tried "hurr..\\n..durr".encode('unicode-escape').decode().
This pretty much works except for the back reference '\\1', as demonstrated by running the following code, using 'raw_unicode-escape' instead of 'unicode-escape'
>>>z = '\1\1'
>>>z.encode('raw_unicode-escape').decode())
>>>'\x01\x01'
It also seems to work for other things like '\\s' but it will spit me the following error when I have symbols like "æ":
>>> x = 'æ'
>>> x.encode('raw_unicode-escape').decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: unexpected end of data
How do I get python to handle all of these consistently to give me what I want?
edit: perhaps I am mis interpreting the problem so I'm providing my entire set-up to see if it makes sense.
my csv file looks like this:
from,to
(.)ː,\\1\\1
æ,a
..,..
>>>with open('mappings.csv','r') as f:
>>> lst = [line.strip('\n').split(',') for line in f]
>>> mapping = {line[0]:line[1] for line in lst[1:]}
>>>def caphia(word):
>>> arpaword = word
>>> tmp = []
>>> for map in mapping.keys():
>>> tmp.append(re.findall(map, arpaword))
>>> tmp = sum(tmp,[])
>>> tmp.sort(key = lambda s: -len(s))
>>> unq = set(tmp)
>>> for pattern in unq:
>>> arpaword = re.sub(pattern, mapping[pattern], arpaword)
>>> print(arpaword)
run function
>>> caphia('oːg')
>>>o ːg
i tried the below code. Look, the \\1\\1 is read from a file "text.txt". When I read from file, the code works without any issues. I think, when you do this replpattern = '\\1\\1 '
in the python interpreter, the pattern is parsed. However, during runtime, when the pattern is read from file, the pattern is already treated as raw string.
pattern = '(.):'
replpattern = open('text.txt').read() # Reading '\1\1 ' form file
print(re.sub(pattern, replpattern, 'o:g'))
This turned out to be a non-question.. as helpful commenters pointed out, strings read from files don't need any processing.
The problem was with my entire set-up, which I ended up changing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.