I am trying to get information related to any website the user is trying to access. To stop any malicious website access, I need details like blacklist status, IP address, Location of server etc. I got this from URLVOID website. < https://www.urlvoid.com/scan/ >
I am getting following results in a table format and trying to fetch the same in spyder. See the Table
I am using regex approach to get particulars from the table.
######
import httplib2
import re
def urlvoid(urlInput):
h2 = httplib2.Http(".cache")
resp, content2 = h2.request(("https://www.urlvoid.com/scan/" + urlInput), "GET")
content2String = (str(content2))
rpderr = re.compile('\<div\sclass\=\"error\"\>', re.IGNORECASE)
rpdFinderr = re.findall(rpderr,content2String)
if "error" in str(rpdFinderr):
ipvoidErr = True
else:
ipvoidErr = False
if ipvoidErr == False:
rpd2 = re.compile('(?<=Server Location</span></td><td>)[a-zA-Z0-9.]+(?=</td></tr>)')
rpdFind2 = re.findall(rpd2,content2String)
rpdSorted2=sorted(rpdFind2)
return rpdSorted2
urlvoid("google.com")
######
However, it is not much efficient and this regex does not work with all the websites. Is there any simpler way to get all this information?
I do not suggest you to scrape data with regex because it can be done by bs4
, and if you build up a regex to complete that you need long time and complex condition.
import requests
from bs4 import BeautifulSoup,NavigableString
import re
def urlvoid(urlInput):
url = "https://www.urlvoid.com/scan/" + urlInput
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text,"lxml").find("table",class_="table table-custom table-striped")
all_tr = soup.find_all("tr")
value = { tr.find_all("td")[0].text :
tr.find_all("td")[1].text.replace("\xa0","")
for tr in all_tr}
print(value)
urlvoid("google.com")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.