简体   繁体   中英

Google Translation API id blocking ip address for too many requests

I'm setting up a Django Views that requests products data from an API, parse them with BeautifulSoup , apply the googletrans module and save the response into my Postgresql database.

Everything was working fine yesterday until suddenly, Google blocked access on my IP address for too many requests at once..

I just turned on my LTE to change my IP address and it worked.

But now, to make sure that it doesn't happen with this IP address again I need to find a way to call the googletrans API in batches or any other solution that would prevent me from getting blocked again.

This is my Views:

from bs4 import BeautifulSoup
from googletrans import Translator
import requests
import json


def api_data(request):
    if request.GET.get('mybtn'):  # to improve, == 'something':
        resp_1 = requests.get(
            "https://www.headout.com/api/public/v1/product/listing/list-by/city?language=fr&cityCode=PARIS&limit=5000&currencyCode=CAD",
            headers={
                "Headout-Auth": HEADOUT_PRODUCTION_API_KEY
            })
        resp_1_data = resp_1.json()
        base_url_2 = "https://www.headout.com/api/public/v1/product/get/"

        translator = Translator()

        for item in resp_1_data['items']:
            print('translating item {}'.format(item['id']))
            # concat ID to the URL string
            url = '{}{}'.format(base_url_2, item['id'] + '?language=fr')

            # make the HTTP request
            resp_2 = requests.get(
                url,
                headers={
                    "Headout-Auth": HEADOUT_PRODUCTION_API_KEY
                })
            resp_2_data = resp_2.json()

            descriptiontxt = resp_2_data['contentListHtml'][0]['html'][0:2040] + ' ...'

            #Parsing work
            soup = BeautifulSoup(descriptiontxt, 'lxml')
            parsed = soup.find('p').text

            #Translation doesn't work
            translation = translator.translate(parsed, dest='fr')

            titlename = item['name']
            titlefr = translator.translate(titlename, dest='fr')

            destinationname = item['city']['name']
            destinationfr = translator.translate(destinationname, dest='fr')

            Product.objects.get_or_create(
                title=titlefr.text,
                destination=destinationfr.text,
                description=translation.text,
                link=item['canonicalUrl'],
                image=item['image']['url']
            )

    return render(request, "form.html")

How can I call the Google translation API in Batch? Or is there any other solution for that?

Please help.

EDIT

Based on @ddor254 where should I put the: time.sleep(2) ?

This is what I came out with, is this okay?

  Product.objects.get_or_create(
      title=titlefr.text,
      destination=destinationfr.text,
      description=translation.text,
      link=item['canonicalUrl'],
      image=item['image']['url']
  )time.sleep(2) #here

or like this:

resp_1 = requests.get(
            "https://www.headout.com/api/public/v1/product/listing/list-by/city?language=fr&cityCode=PARIS&limit=5000&currencyCode=CAD",
            headers={
                "Headout-Auth": HEADOUT_PRODUCTION_API_KEY
            }, time.sleep(2)) #here

Just want to make sure that its the right way to do it before risking of getting this new IP also blocked.

I suggest you read this article from MDN: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429

if this is the response you get so try and look at the header Retry-After in the response object.

so adding a sleep or other delay method, with the value of that header might fix your problem.

Try adding delays between consecutive queries(using sleep) and play with the numbers to see what works for you. 2s delay after every pair of translation and 15s after every 10 to works fine for me.

I have been blocked too because of many concurrent requests. Usually always gets blocked after 500 concurrent requests. What I did was to put a timeout of 60 seconds after every 100 concurrent requests. It may seem long, but it works. You could also achieve that with a 45 seconds timeout, but I set it to 60 just to make sure.

Here's an example

class GoogleAPI():

    def __init__(self):
        self.limit_before_timeout = 100
        self.timeout = 60

    def translate(self, source):
        translation = translator.translate(source, dest="ar")
        translation = translation.__dict__['text']
        if translation != "" and translation is not None:
            return translation

    def process(self):
        i = 0
        print("initiation")
        for t in list_of_data:
            if i < self.limit_before_timeout:
                i += 1
                self.translate(t)
            else:
                i = 0
                print("100 words added")
                time.sleep(self.timeout)
        print("All done")

My IP is blocked after ~450 concurrent connections. I am using php for loop to translate my text array.

So, I changed my IP Address and and changed my code for waiting after every x seconds.

My Codes in For loop ($i is value from for loop):

if ($i % 100 == 0 && $i!=0) {
    //wait 60 seconds every 100
    usleep(60000000);   // 60 seconds
    echo str_pad("XX--> WAITING 60 SECONDS<br>",4096);
}               
else 
if ($i % 10 == 0  && $i!=0) {
    //wait 15 seconds every 10
     usleep(15000000); // 15 seconds
     echo str_pad("XX--> WAITING 15 SECONDS<br>",4096);
}
else    
if ($i % 2 == 0  && $i!=0) {
    //wait 2 seconds every 2
     usleep(2000000); // 2 seconds
     echo str_pad("XX--> WAITING 2 SECONDS<br>",4096);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM