如何使用Beautifulsoup-python从div中特定标题的段落元素中的网页元素中提取网页文本

Question

Basically the title. 基本上是标题。 I am trying to pull the paragraph text from the area underneath "genecards summary for name_of_gene gene" from https://www.genecards.org/cgi-bin/carddisp.pl?gene=IL6&keywords=il6 using the IL-6 gene as an example. 我正在尝试使用IL-6基因从https://www.genecards.org/cgi-bin/carddisp.pl?gene=IL6&keywords=il6的 “ name_of_gene基因的基因卡摘要”下面的区域中提取段落文本。一个例子。 what I want to pull is would like to pull just "IL6 (Interleukin 6) is a Protein Coding gene. Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile. Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway. Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity." 我要拉的只是想拉“ IL6（白介素6）是一种蛋白质编码基因。与IL6相关的疾病包括卡波西肉瘤和类风湿关节炎，系统性少年。其相关途径包括IL-1家庭信号传导途径和免疫应答IFNα/β信号通路。与此基因相关的基因本体论（GO）注释包括信号受体结合和生长因子活性。”

I have been trying to use Beautifulsoup 4 with python. 我一直在尝试将Beautifulsoup 4与python结合使用。 The issue I am having specifically is that I just don't know how to specify what text I want to pull from the website. 我具体遇到的问题是我只是不知道如何指定要从网站中提取的文本。

from bs4 import BeautifulSoup

from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")`
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

for tag in soup.find_all(['script', 'style']):
   tag.decompose()
soup.get_text(strip=True)
VALID_TAGS = ['div', 'p']

for tag in soup.findAll('GeneCards Summary for '+ GeneToSearch +    'Gene'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print(soup.text)

This just ends up giving me every element from the website. 这最终给了我网站上的每个元素。

Answer 1

Try to navigate between tags, something like this: 尝试在标签之间导航，如下所示：

soup.select('.gc-subsection-header')[1].next_sibling.next_sibling.text

Ref.: Beautiful Soup 参考：美丽的汤

Answer 2

Using the latest version of BeautifulSoup you can use a pseudo css selector (:contains) to search for a tag with specific text. 使用最新版本的BeautifulSoup您可以使用伪CSS选择器（：contains）搜索带有特定文本的标签。 You can then navigate to the next p tag and extract corresponding text: 然后，您可以导航到下一个p标签并提取相应的文本：

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

text_find = 'GeneCards Summary for ' + GeneToSearch + ' Gene'

el = soup.select_one('h3:contains("' + text_find + '")') summary = el.parent.find_next('p').text.strip()

print(summary)

Outputs: 输出：

IL6 (Interleukin 6) is a Protein Coding gene.
Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile.
Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway.
Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity.

如何使用Beautifulsoup-python从div中特定标题的段落元素中的网页元素中提取网页文本

问题描述

2 个解决方案

解决方案1
0 2019-08-24 01:19:20

解决方案2
0 已采纳 2019-08-24 01:28:39

如何使用Beautifulsoup-python从div中特定标题的段落元素中的网页元素中提取网页文本

问题描述

2 个解决方案

解决方案1 0 2019-08-24 01:19:20

解决方案2 0 已采纳 2019-08-24 01:28:39

解决方案1
0 2019-08-24 01:19:20

解决方案2
0 已采纳 2019-08-24 01:28:39