I would like to use a kind of data-augmentation method for NLP consisting of back-translating dataset.
Basically, I have a large dataset ( SNLI ), consisting of 1 100 000 english sentences. What I need to do is : translate these sentences in a language, and translate it back to English.
I may have to do this for several language. So I have a lot of translations to do.
I need a free solution.
I tried several python module for translation, but due to recent changes in Google Translate API, most of them do not work. googletrans seems to work if we apply this solution .
However, it is not working for big dataset. There is a limit of 15K characters by Google (as pointed out by this , this and this ). The first link show a supposed work-around.
Even if I apply the work-around (initializing the Translator every iteration), it is not working, and I got the following error :
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I tried using proxies and others Google translate URLs :
URLS = ['translate.google.com', 'translate.google.co.kr', 'translate.google.ac', 'translate.google.ad', 'translate.google.ae', ...]
proxies = { 'http': '1.243.64.63:48730', 'https': '59.11.98.253:42645', }
t = Translator(service_urls=URLS, proxies=proxies)
But it's not changing anything.
My problem might come from the fact that I am using multi-threading : 100 workers for translating the whole dataset. If they work in parallel, maybe they use more than 15k characters together.
But I should use multi-threading. If I don't, it will take several weeks to translate the whole dataset...
How do I fix this error so I can translate all sentences ?
If it's not possible, is there any free alternative, to get machine translation in Python (not mandatory to use Google Translate), for such a big dataset ?
One million characters is pretty much text to be translated.
Currently, the Google Cloud Translation V3 offers a free tier quota that you may want to use (1-500,000 characters free per month). Since it doesn't seem to be enough for your use case, you probably need to create more than one billing accounts or wait for a month to translate more text.
Check this link to know how you can perform a text translation with python.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.