简体   繁体   English

我将如何使用美丽的汤从这个网站上抓取数据?

[英]How would I use beautiful soup to webscrape data from this website?

[![Problem][1]][1] [![问题][1]][1]

HTML

Above is the HTML, what the website looks like, and my code.上面是 HTML、网站的外观和我的代码。 I am trying to extract this information into a dictionary.我正在尝试将此信息提取到字典中。 for example {"Official Symbol: ELF4"} and so on.例如 {"Official Symbol: ELF4"} 等等。 I have already watched a few tutorials but I'm still confused.我已经看过一些教程,但我仍然感到困惑。 can anyone help me out?谁能帮我吗?

import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/gene/2000"
r  = requests.get(url)
data = r.content
soup = BeautifulSoup(data, 'html.parser')
#text_found = soup.find("dd",attrs={"class":"noline"}).text

dd_data = soup.find_all("dd")
for dditem in dd_data:
    if dditem != "None":
        print(dditem.string)

da_data = soup.find_all("dt")
for daitem in da_data:
    if daitem != "None":
        print(daitem.string)

To scrape the data as a dict see the following example:要将数据作为dict抓取,请参见以下示例:

import requests
from bs4 import BeautifulSoup


URL = "https://www.ncbi.nlm.nih.gov/gene/2000"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

result = {
    k.text.replace(" ", "").replace("\n", " "): v.find_next(text=True)
    for k in soup.select("dt.noline")
    for v in soup.select("dd.noline")
}


print(result)

Output:输出:

{'Official Symbol': 'ELF4'}

I think you can just create two lists, fill them in your loops or at once like this:我认为您可以创建两个列表,将它们填充到您的循环中或立即像这样:

dd_data = soup.find("dl", { "id" : "summaryDl" })


fields =[]
contents = []

# Append them sequentially, assuming the order is correct
for dditem in dd_data.find_all():
    if dditem.name == "dd":
        fields.append(dditem.text)
    if dditem.name == "dt":
        contents.append(dditem.text)
        
# zip the two lists together creating a list of pairs, then make a dictionary out of the list of pairs
contents = dict((zip(contents,fields)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM