简体   繁体   中英

how to use string variable containing back-reference '\1' as regex pattern for converting ipa symbols to arpabet-like system

I am converting IPA symbols to a system inspired by ARPABET

eg:

'oːg' > 'oo g'

In this test example, I can reach the desired result by doing the following:

>>>re.sub(r'(.)ː', r'\1\1 ', 'oːg')
>>>'oo g'

I understand 'r' here is essential so that the backslash '\\' is not escaped and regex can treat '\\1' as back reference.

What I want is to be able to iterate through a dict (created from a csv file) that contains many regex rules like this:

mappings = {'(.)ː': '\1\1 ','foo': 'bar', ..} 

where I look for patterns stored in the dict keys within each of my IPA words, and do a re.sub using the corresponding values.

simply put, I want this:

>>>pattern = '(.)ː'
>>>replpattern = '\1\1 '
>>>ipa = 'oːg'
>>>arpa = re.sub(pattern, replpattern, ipa)
>>>print(arpa)
>>>'oo g'

The tricky part is to get python to treat all the different patterns as raw strings.

Following suggestion from a similar thread -- casting raw strings python -- I tried "hurr..\\n..durr".encode('unicode-escape').decode().

  • issue1:

This pretty much works except for the back reference '\\1', as demonstrated by running the following code, using 'raw_unicode-escape' instead of 'unicode-escape'

>>>z = '\1\1'
>>>z.encode('raw_unicode-escape').decode())
>>>'\x01\x01'
  • issue2:

It also seems to work for other things like '\\s' but it will spit me the following error when I have symbols like "æ":

>>> x = 'æ'
>>> x.encode('raw_unicode-escape').decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: unexpected end of data

How do I get python to handle all of these consistently to give me what I want?

edit: perhaps I am mis interpreting the problem so I'm providing my entire set-up to see if it makes sense.

my csv file looks like this:

from,to

(.)ː,\\1\\1

æ,a

..,..

>>>with open('mappings.csv','r') as f:

>>>        lst = [line.strip('\n').split(',') for line in f]
>>>        mapping = {line[0]:line[1] for line in lst[1:]}

>>>def caphia(word):

>>>        arpaword = word

>>>        tmp = [] 
>>>        for map in mapping.keys():
>>>            tmp.append(re.findall(map, arpaword))


>>>        tmp = sum(tmp,[])


>>>        tmp.sort(key = lambda s: -len(s))
>>>        unq = set(tmp)

>>>        for pattern in unq:
>>>            arpaword = re.sub(pattern, mapping[pattern], arpaword)

>>>        print(arpaword)

run function
>>>    caphia('oːg')

>>>o ːg 

i tried the below code. Look, the \\1\\1 is read from a file "text.txt". When I read from file, the code works without any issues. I think, when you do this replpattern = '\\1\\1 ' in the python interpreter, the pattern is parsed. However, during runtime, when the pattern is read from file, the pattern is already treated as raw string.

pattern = '(.):'
replpattern = open('text.txt').read() # Reading '\1\1 ' form file

print(re.sub(pattern, replpattern, 'o:g'))

This turned out to be a non-question.. as helpful commenters pointed out, strings read from files don't need any processing.

The problem was with my entire set-up, which I ended up changing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM