How to use multiple corpora files to use as parallel corpora in Watson Language Translator in Python

Question

The Watson Language Translator documentation says:

"A TMX file with parallel sentences for source and target language. You can upload multiple parallel_corpus files in one request. All uploaded parallel_corpus files combined, your parallel corpus must contain at least 5,000 parallel sentences to train successfully."

I have a number of corpora files which I would use to train my translation model. I've looked up possible ways to do so programmatically with no success.

The only way I found to do so is by merging them manually into one single file.

Is there any way to send several files as parallel corpus via the API?

Can you provide examples in Python or Curl?

Thanks.

The only thing which worked just yest is aggregating the .TMX files manually and sending just one file. I have not found any way of sending several files as parallel_corpora

with open("./training/my_corpus_SPA.TMX", "rb") as parallel:
custom_model = language_translation.create_model(
base_model_id = 'en-es',
name = 'en-es-base1xx',
#forced_glossary = glossary,
parallel_corpus = parallel).get_result()
print(json.dumps(custom_model, indent=2))

Answer 1

I think I found a solution in here

I tried this and it seems to work:

with open(corpus_fname1, 'rb') as parallel1 , open(corpus_fname2, 'rb') as parallel2 :

 custom_model = language_translation.create_model(
     base_model_id = base_model_es_en,
     name = model01_name,
     parallel_corpus = parallel1,
     parallel_corpus_filename2 = parallel2,
     forced_glossary=None).get_result()

How to use multiple corpora files to use as parallel corpora in Watson Language Translator in Python

Question

1 answers

solution1
0 2019-07-23 06:33:30

How to use multiple corpora files to use as parallel corpora in Watson Language Translator in Python

Question

1 answers

solution1 0 2019-07-23 06:33:30

solution1
0 2019-07-23 06:33:30