I am learning Beautiful Soup for Python and trying to parse a website " https://www.twitteraudit.com/ ". When I enter a twitter id in the search bar, it returns the results for some id in a fraction of seconds, but some id takes about a minute to process the data. In this case, how can I parse the HTML after it gets loaded or the result is done? And I tried to loop it, but it doesn't work that way. But what I figured was if I open a browser and load the web link and once its done it is storing the cache in the computer and the next time when I run for the same id it works perfectly.
Can anyone help me out with this? I appreciate the help. I attach the code below>>
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
from re import sub
def HTML(myURL):
uClient = uReq(myURL)
pageHTML = uClient.read()
uClient.close()
pageSoup = soup(pageHTML, "html.parser")
return pageSoup
def fakecheck(usr):
myURLfc = "https://www.twitteraudit.com/" + usr
pgSoup = HTML(myURLfc)
foll = pgSoup.findAll("div",{"class":"audit"})
link = foll[0].div.a["href"]
real = foll[0].findAll("span",{"class":"real number"})[0]["data-value"]
fake = foll[0].findAll("span",{"class":"fake number"})[0]["data-value"]
scr = foll[0].findAll("div",{"class":"score"})[0].div
scoresent = scr["class"][1]
score = re.findall(r'\d{1,3}',str(scr))[0]
return [link, real, fake, scoresent, score]
lis = ["BarackObama","POTUS44","ObamaWhiteHouse","MichelleObama","ObamaFoundation","NSC44","ObamaNews","WhiteHouseCEQ44","IsThatBarrak","obama_barrak","theprezident","barrakubama","BarrakObama","banackkobama","YusssufferObama","barrakisdabomb_","BarrakObmma","fuzzyjellymasta","BarrakObama6","bannalover101","therealbarrak","ObamaBarrak666","barrak_obama"]
for u in lis:
link, real, fake, scoresent, score = fakecheck(u)
print ("link : " + link)
print ("Real : " + real)
print ("Fake : " + fake)
print ("Result : " + scoresent)
print ("Score : " + score)
print ("=================")
I think the problem is some of the Twitter ID's have not yet been audited, and so I was getting an IndexError
. However, putting the call to fakecheck(u)
in a while True:
loop that catches that error will continually check the website until an audit has been performed on that ID.
I put this code after the lis
definition:
def get_fake_check(n):
return fakecheck(n)
for u in lis:
while True:
try:
link, real, fake, scoresent, score = get_fake_check(u)
break
except:
pass
I'm not sure if there is a way to automate the audit request on the website, but when a query is waiting, I manually clicked the " Audit " button on the website for that ID, and once the audit was completed, the script continued as usual until all ID audits were processed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.