简体   繁体   English

BeautifulSoup - 使用过滤文本提取特定元素

[英]BeautifulSoup - extracting particular element using filtering text

I'm relatively new to webpage crawling, and I need to extract a particular element out of it, in this case 'Research Project – Cooperative Agreements' right after the hyperlink in the data column.我对网页爬行比较陌生,我需要从中提取一个特定的元素,在本例中'Research Project – Cooperative Agreements' ,就在数据列中的超链接之后。

I've been searching for 'Search_Type=Activity' in the hyperlink using the following code:我一直在使用以下代码在超链接中搜索'Search_Type=Activity'

for elem in soup(href=lambda href: href and "Search_Type=Activity" in href):
    print (elem.parent)

Because I'm crawling a bunch of NIH grant pages, and I need content of "Activity Code" and they all appear right after the hyperlink with terms 'Search_Type=Sctivity' in it.因为我正在抓取一堆 NIH 授权页面,我需要“活动代码”的内容,它们都出现在超链接之后,其中包含“Search_Type=Sctivity”。

So here is the HTML content that I've narrowed down using the code:所以这里是我使用代码缩小的 HTML 内容:

<div class="col-md-8 datacolumn"> <a href="//grants.nih.gov/grants/funding/ac_search_results.htm?text_curr=u01&amp;Search.x=0&amp;Search.y=0&amp;Search_Type=Activity">U01</a> Research Project – Cooperative Agreements
        <!--</div>
                </div> end row -->
<!-- If it is not the first row we close the previous row div tags -->
</div>

FYI, The original page used is just an NIH grant here .仅供参考, 这里使用的原始页面只是 NIH 授权。
Could someone point out what that element is and how to get it out from there?有人能指出那个元素是什么以及如何从那里得到它吗?

Try:尝试:

import requests
from bs4 import BeautifulSoup

url = "https://grants.nih.gov/grants/guide/rfa-files/RFA-DK-19-501.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

name = (
    soup.select_one('[href*="Search_Type=Activity"]')
    .find_next_sibling(text=True)
    .strip()
)
print(name)

Prints:印刷:

Research Project – Cooperative Agreements

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM