简体   繁体   中英

Python requests or beautifulsoup4 delay

I wrote a code to scrape the binance announcement site ( https://www.binance.com/en/support/announcement/c-48?navId=48 ) and get the last < a > tag and do something with it. The problem is that when Binance releases a new announcement with a new < a > tag, my code detects it after 3-5 minutes. So it has a 3-5 minute delay. Also, I tried that same code on my personal site and it works perfectly without any delay. Why is that and what might cause this issue?

session = requests_cache.CachedSession('demo_cache')

####### first check of <a> ########
def getFirstLink():
    pageForFirstCheck = session.get(siteUrl)
    soupForFirstCheck = BeautifulSoup(pageForFirstCheck.content, "html.parser")
    resultForFirstCheck = soupForFirstCheck.find('div', class_='css-6f91y1')
    firstDiv = resultForFirstCheck.find('div', class_='css-vurnku')
    firstLink = firstDiv.find('a')
    prevLink =  firstLink.get_text()  # <a> cel mai de sus
    return prevLink

Also, I wrap this function inside a while True loop:

while True:
    time.sleep(random.randint(1, 5))
    try:
        stringThatCameFromLink = getFirstLink()
        # and it does something with that link

Thank you in advance!

I think the problem is that the cloudflare server is caching documents. Or it was done deliberately by the binance programmers, so that a narrow circle of people could react to the news faster than everyone else. This is a big problem if you want to get fresh data. If you look at the HTTP headers, you will notice that the "Date:" header is cached by the server, which means that the entire content of the document is cached. I managed to get 2 different "Date:" if I add or remove the gzip header. "accept-encoding: gzip, deflate". I am using the page https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15 If you change the "pageSize" parameter, you can get fresh cached responses from the server. But that still doesn't solve the 5 minute delay issue. And I still see the old page. Your link is https://www.binance.com/en/support/announcement/c-48?navId=48 like mine https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15 is also cached for 5 seconds. And my guess is that there will be a 5 minute delay as well. I have not found a solution to this problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM