使用 Python 从网站中提取表格数据

Question

I am trying to get information related to any website the user is trying to access.我正在尝试获取与用户尝试访问的任何网站相关的信息。 To stop any malicious website access, I need details like blacklist status, IP address, Location of server etc. I got this from URLVOID website.要阻止任何恶意网站访问，我需要详细信息，例如黑名单状态、IP 地址、服务器位置等。我是从 URLVOID 网站获得的。 < https://www.urlvoid.com/scan/ > < https://www.urlvoid.com/scan/ >

I am getting following results in a table format and trying to fetch the same in spyder.我正在以表格格式获得以下结果，并尝试在 spyder 中获取相同的结果。 See the Table见表

I am using regex approach to get particulars from the table.我正在使用正则表达式方法从表格中获取详细信息。

######

import httplib2 
import re
def urlvoid(urlInput):                
    h2 = httplib2.Http(".cache")
    resp, content2 = h2.request(("https://www.urlvoid.com/scan/" + urlInput), "GET")
    content2String = (str(content2))
    rpderr = re.compile('\<div\sclass\=\"error\"\>', re.IGNORECASE)
    rpdFinderr = re.findall(rpderr,content2String)
    if "error" in str(rpdFinderr):
        ipvoidErr = True
    else:
        ipvoidErr = False
    if ipvoidErr == False:

        rpd2 = re.compile('(?<=Server Location</span></td><td>)[a-zA-Z0-9.]+(?=</td></tr>)')
        rpdFind2 = re.findall(rpd2,content2String)
        rpdSorted2=sorted(rpdFind2)

    return rpdSorted2

urlvoid("google.com")
######

However, it is not much efficient and this regex does not work with all the websites.但是，它的效率并不高，并且此正则表达式不适用于所有网站。 Is there any simpler way to get all this information?有没有更简单的方法来获取所有这些信息？

Answer 1

I do not suggest you to scrape data with regex because it can be done by bs4 , and if you build up a regex to complete that you need long time and complex condition.我不建议您使用正则表达式抓取数据，因为它可以通过bs4完成，如果您建立一个正则表达式来完成，则需要很长时间和复杂的条件。

import requests
from bs4 import BeautifulSoup,NavigableString
import re

def urlvoid(urlInput):
    url = "https://www.urlvoid.com/scan/" + urlInput
    res = requests.get(url)
    text = res.text
    soup = BeautifulSoup(text,"lxml").find("table",class_="table table-custom table-striped")
    all_tr = soup.find_all("tr")
    value = { tr.find_all("td")[0].text : 
                tr.find_all("td")[1].text.replace("\xa0","")
                for tr in all_tr}
    print(value)

urlvoid("google.com")

使用 Python 从网站中提取表格数据

问题描述

1 个解决方案

解决方案1
0 2018-11-15 03:54:15

使用 Python 从网站中提取表格数据

问题描述

1 个解决方案

解决方案1 0 2018-11-15 03:54:15

解决方案1
0 2018-11-15 03:54:15