简体   繁体   English

如何使用Beautifulsoup-python从div中特定标题的段落元素中的网页元素中提取网页文本

[英]How to pull text from webpage from paragraph element in specific header inside a div using Beautifulsoup-python

Basically the title. 基本上是标题。 I am trying to pull the paragraph text from the area underneath "genecards summary for name_of_gene gene" from https://www.genecards.org/cgi-bin/carddisp.pl?gene=IL6&keywords=il6 using the IL-6 gene as an example. 我正在尝试使用IL-6基因从https://www.genecards.org/cgi-bin/carddisp.pl?gene=IL6&keywords=il6的 “ name_of_gene基因的基因卡摘要”下面的区域中提取段落文本。一个例子。 what I want to pull is would like to pull just "IL6 (Interleukin 6) is a Protein Coding gene. Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile. Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway. Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity." 我要拉的只是想拉“ IL6(白介素6)是一种蛋白质编码基因。与IL6相关的疾病包括卡波西肉瘤和类风湿关节炎,系统性少年。其相关途径包括IL-1家庭信号传导途径和免疫应答IFNα/β信号通路。与此基因相关的基因本体论(GO)注释包括信号受体结合和生长因子活性。”

I have been trying to use Beautifulsoup 4 with python. 我一直在尝试将Beautifulsoup 4与python结合使用。 The issue I am having specifically is that I just don't know how to specify what text I want to pull from the website. 我具体遇到的问题是我只是不知道如何指定要从网站中提取的文本。

from bs4 import BeautifulSoup

from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")`
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

for tag in soup.find_all(['script', 'style']):
   tag.decompose()
soup.get_text(strip=True)
VALID_TAGS = ['div', 'p']

for tag in soup.findAll('GeneCards Summary for '+ GeneToSearch +    'Gene'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print(soup.text)

This just ends up giving me every element from the website. 这最终给了我网站上的每个元素。

Try to navigate between tags, something like this: 尝试在标签之间导航,如下所示:

soup.select('.gc-subsection-header')[1].next_sibling.next_sibling.text

Ref.: Beautiful Soup 参考: 美丽的汤

Using the latest version of BeautifulSoup you can use a pseudo css selector (:contains) to search for a tag with specific text. 使用最新版本的BeautifulSoup您可以使用伪CSS选择器(:contains)搜索带有特定文本的标签。 You can then navigate to the next p tag and extract corresponding text: 然后,您可以导航到下一个p标签并提取相应的文本:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

text_find = 'GeneCards Summary for ' + GeneToSearch + ' Gene'

el = soup.select_one('h3:contains("' + text_find + '")') summary = el.parent.find_next('p').text.strip()

print(summary)

Outputs: 输出:

IL6 (Interleukin 6) is a Protein Coding gene.
Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile.
Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway.
Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM