我将如何使用美丽的汤从这个网站上抓取数据？

Question

[![Problem][1]][1] [![问题][1]][1]

Above is the HTML, what the website looks like, and my code.上面是 HTML、网站的外观和我的代码。 I am trying to extract this information into a dictionary.我正在尝试将此信息提取到字典中。 for example {"Official Symbol: ELF4"} and so on.例如 {"Official Symbol: ELF4"} 等等。 I have already watched a few tutorials but I'm still confused.我已经看过一些教程，但我仍然感到困惑。 can anyone help me out?谁能帮我吗？

import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/gene/2000"
r  = requests.get(url)
data = r.content
soup = BeautifulSoup(data, 'html.parser')
#text_found = soup.find("dd",attrs={"class":"noline"}).text

dd_data = soup.find_all("dd")
for dditem in dd_data:
    if dditem != "None":
        print(dditem.string)

da_data = soup.find_all("dt")
for daitem in da_data:
    if daitem != "None":
        print(daitem.string)

Answer 1

To scrape the data as a dict see the following example:要将数据作为dict抓取，请参见以下示例：

import requests
from bs4 import BeautifulSoup


URL = "https://www.ncbi.nlm.nih.gov/gene/2000"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

result = {
    k.text.replace(" ", "").replace("\n", " "): v.find_next(text=True)
    for k in soup.select("dt.noline")
    for v in soup.select("dd.noline")
}


print(result)

Output:输出：

{'Official Symbol': 'ELF4'}

Answer 2

I think you can just create two lists, fill them in your loops or at once like this:我认为您可以创建两个列表，将它们填充到您的循环中或立即像这样：

dd_data = soup.find("dl", { "id" : "summaryDl" })


fields =[]
contents = []

# Append them sequentially, assuming the order is correct
for dditem in dd_data.find_all():
    if dditem.name == "dd":
        fields.append(dditem.text)
    if dditem.name == "dt":
        contents.append(dditem.text)
        
# zip the two lists together creating a list of pairs, then make a dictionary out of the list of pairs
contents = dict((zip(contents,fields)))

我将如何使用美丽的汤从这个网站上抓取数据？

问题描述

2 个解决方案

解决方案1
1 2020-11-01 23:07:37

解决方案2
0 已采纳 2020-11-01 22:26:15

我将如何使用美丽的汤从这个网站上抓取数据？

问题描述

2 个解决方案

解决方案1 1 2020-11-01 23:07:37

解决方案2 0 已采纳 2020-11-01 22:26:15

解决方案1
1 2020-11-01 23:07:37

解决方案2
0 已采纳 2020-11-01 22:26:15