简体   繁体   中英

Creating a parallel corpus from list of words and list of sentences (Python)

I'm trying to create a parallel corpus for supervised machine learning.

Essentially I want to have two files, one with one full sentences per line and the other one with only specific manually extracted terms that correspond to the sentence on the same line.

I have already create the file with one sentence per line; now I would like to generate the labels file with the terms in each line. For illustration, this is the code I came up with:

import re

list_of_terms = ["cake", "cola", "water", "stop"]
sentences = ["Let's eat some cake.", "I'd like to have some cola to go with the cake.", "stop eating all this cake, you waterstopper", "I will never eat this again", "cake and cola and water"]
para = []
for line in sentences:
    s = re.findall(r"(?=\b("+'|'.join(list_of_terms)+r")\b)", line)
    para.append(s)
print(*para, sep = "\n")

This results in the output I want:

['cake']
['cola', 'cake']
['stop', 'cake']
[]
['cake', 'cola', 'water']

Unfortunately the code does not work very well for the corpora I'm dealing with. In fact, I'm faced with 3 different kinds of exception.

  1. For one corpora the re.findall function always outputs and additional '' to each term.

[('criminal', ''), ('liability', ''), ('legal', ''), ('fiscal', ''), ('criminal', ''), ('law', '')]

I solved this thanks to the last comment in this thread: Use of findall and parenthesis in Python

[x if x!='' else y for x,y in re.findall(r"(?=\\b("+'|'.join(list_of_terms)+r")\\b)]

  1. However, this method throws up a ValueError, as regex is not creating the '' for two other corpora I'm working with. For those I simply use a try except - block and run the sample code with satisfactory result. But why is regex not creating the '' in this case?

  2. Finally, one other corpra raises an re.error "re.error: nothing to repeat at position 4950" and I have found no fix for this yet. I suspect there are special characters in the "list_of_terms"; any way to filter those beforehand?

Needless to say, I'm still quite new to coding as my background is translation and not computer science. So a graceful answer would be much appreciated! :)

PS: The corpora I am using are all in the the ACTER Corpus-Collection: https://github.com/AylaRT/ACTER

You need to re.escape each of the item in the list_of_terms list, and use unambiguous word boundaries:

re.findall(r"(?=(?<!\w)("+'|'.join(map(re.escape, list_of_terms))+r")(?!\w))", line)

The (?<!\\w) negative lookbehind matches a location that is not immediately preceded with a word char (digit, letter or _ ).

The (?!\\w) negative lookahead matches a location that is not immediately followed with a word char.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM