繁体   English   中英

无法使用 BeautifulSoup 提取数据

[英]Unable to extract data using BeautifulSoup

我正在尝试从此示例中获取服务器列表。

https://pastebin.com/eHGwhVmz

from bs4 import BeautifulSoup as bs

with open('html.txt', 'r') as html:
    soup = bs(html, 'html.parser')
    div = soup.find('div', class_='grid_8')
    for tag in div:
        tag = div.find_all('td', class_='StatTDLabel')[2].text
        print(tag)

我可以获得列表中的第一台服务器,但我无法遍历它们。 当我尝试使用 for 循环时,我得到了相同的结果。

这是你想要的吗?

from bs4 import BeautifulSoup
from tabulate import tabulate

sample_html = """The contents of your pastebin"""

soup = BeautifulSoup(sample_html, "html.parser").find_all("tr")
servers = [
    [i.getText(strip=True) for i in row.find_all("td")] for row in soup[1:]
]
print(tabulate(servers, headers=["Country", "Location", "Address", "Status"]))

Output:

Country    Location      Address               Status
---------  ------------  --------------------  -------------
ZA         Johannesburg  jnb-c17.ipvanish.com  15 % capacity
ZA         Johannesburg  jnb-c18.ipvanish.com  15 % capacity
ZA         Johannesburg  jnb-c19.ipvanish.com  31 % capacity
ZA         Johannesburg  jnb-c20.ipvanish.com  12 % capacity
ZA         Johannesburg  jnb-c21.ipvanish.com  9 % capacity
ZA         Johannesburg  jnb-c22.ipvanish.com  10 % capacity
AL         Tirana        tia-c02.ipvanish.com  17 % capacity
AL         Tirana        tia-c03.ipvanish.com  23 % capacity
AL         Tirana        tia-c04.ipvanish.com  19 % capacity
AL         Tirana        tia-c05.ipvanish.com  15 % capacity
AE         Dubai         dxb-c01.ipvanish.com  30 % capacity
AE         Dubai         dxb-c02.ipvanish.com  26 % capacity

要仅获取服务器地址,请选择索引为2的第三列。

例如:

servers = [
    [i.getText(strip=True) for i in row.find_all("td")][2] for row in soup[1:]
]
print("\n".join(servers))

Output:

jnb-c17.ipvanish.com
jnb-c18.ipvanish.com
jnb-c19.ipvanish.com
jnb-c20.ipvanish.com
jnb-c21.ipvanish.com
jnb-c22.ipvanish.com
tia-c02.ipvanish.com
tia-c03.ipvanish.com
tia-c04.ipvanish.com
tia-c05.ipvanish.com
dxb-c01.ipvanish.com
dxb-c02.ipvanish.com

尝试这个:

from bs4 import BeautifulSoup as bs

with open('html.txt', 'r') as html:
    soup = bs(html, 'html.parser')
    tags = div.find_all('td', class_='StatTDLabel')
    for tag in tags:
        tagtext = tag.find(text=True, recursive=False) #take only immediate text of the element and ignore child element texts
        if tagtext:
            print(tagtext)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM