使用 Python 從網站中提取表格數據

Question

我正在嘗試獲取與用戶嘗試訪問的任何網站相關的信息。 要阻止任何惡意網站訪問，我需要詳細信息，例如黑名單狀態、IP 地址、服務器位置等。我是從 URLVOID 網站獲得的。 < https://www.urlvoid.com/scan/ >

我正在以表格格式獲得以下結果，並嘗試在 spyder 中獲取相同的結果。 見表

我正在使用正則表達式方法從表格中獲取詳細信息。

######

import httplib2 
import re
def urlvoid(urlInput):                
    h2 = httplib2.Http(".cache")
    resp, content2 = h2.request(("https://www.urlvoid.com/scan/" + urlInput), "GET")
    content2String = (str(content2))
    rpderr = re.compile('\<div\sclass\=\"error\"\>', re.IGNORECASE)
    rpdFinderr = re.findall(rpderr,content2String)
    if "error" in str(rpdFinderr):
        ipvoidErr = True
    else:
        ipvoidErr = False
    if ipvoidErr == False:

        rpd2 = re.compile('(?<=Server Location</span></td><td>)[a-zA-Z0-9.]+(?=</td></tr>)')
        rpdFind2 = re.findall(rpd2,content2String)
        rpdSorted2=sorted(rpdFind2)

    return rpdSorted2

urlvoid("google.com")
######

但是，它的效率並不高，並且此正則表達式不適用於所有網站。 有沒有更簡單的方法來獲取所有這些信息？

Answer 1

我不建議您使用正則表達式抓取數據，因為它可以通過bs4完成，如果您建立一個正則表達式來完成，則需要很長時間和復雜的條件。

import requests
from bs4 import BeautifulSoup,NavigableString
import re

def urlvoid(urlInput):
    url = "https://www.urlvoid.com/scan/" + urlInput
    res = requests.get(url)
    text = res.text
    soup = BeautifulSoup(text,"lxml").find("table",class_="table table-custom table-striped")
    all_tr = soup.find_all("tr")
    value = { tr.find_all("td")[0].text : 
                tr.find_all("td")[1].text.replace("\xa0","")
                for tr in all_tr}
    print(value)

urlvoid("google.com")

使用 Python 從網站中提取表格數據

問題描述

1 個解決方案

解決方案1
0 2018-11-15 03:54:15

使用 Python 從網站中提取表格數據

問題描述

1 個解決方案

解決方案1 0 2018-11-15 03:54:15

解決方案1
0 2018-11-15 03:54:15