简体   繁体   中英

Python Google Translate API error : How to translate a large amount of data

My problem

I would like to use a kind of data-augmentation method for NLP consisting of back-translating dataset.

Basically, I have a large dataset ( SNLI ), consisting of 1 100 000 english sentences. What I need to do is : translate these sentences in a language, and translate it back to English.

I may have to do this for several language. So I have a lot of translations to do.

I need a free solution.

What I did so far

I tried several python module for translation, but due to recent changes in Google Translate API, most of them do not work. googletrans seems to work if we apply this solution .

However, it is not working for big dataset. There is a limit of 15K characters by Google (as pointed out by this , this and this ). The first link show a supposed work-around.

Where I am blocked

Even if I apply the work-around (initializing the Translator every iteration), it is not working, and I got the following error :

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I tried using proxies and others Google translate URLs :

URLS = ['translate.google.com', 'translate.google.co.kr', 'translate.google.ac', 'translate.google.ad', 'translate.google.ae', ...]

proxies = {    'http': '',   'https': '', }

t = Translator(service_urls=URLS, proxies=proxies)

But it's not changing anything.


My problem might come from the fact that I am using multi-threading : 100 workers for translating the whole dataset. If they work in parallel, maybe they use more than 15k characters together.

But I should use multi-threading. If I don't, it will take several weeks to translate the whole dataset...

My question

How do I fix this error so I can translate all sentences ?

If it's not possible, is there any free alternative, to get machine translation in Python (not mandatory to use Google Translate), for such a big dataset ?

One million characters is pretty much text to be translated.

Currently, the Google Cloud Translation V3 offers a free tier quota that you may want to use (1-500,000 characters free per month). Since it doesn't seem to be enough for your use case, you probably need to create more than one billing accounts or wait for a month to translate more text.

Check this link to know how you can perform a text translation with python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM