BeautifulSoup - 使用过滤文本提取特定元素

Question

I'm relatively new to webpage crawling, and I need to extract a particular element out of it, in this case 'Research Project – Cooperative Agreements' right after the hyperlink in the data column.我对网页爬行比较陌生，我需要从中提取一个特定的元素，在本例中'Research Project – Cooperative Agreements' ，就在数据列中的超链接之后。

I've been searching for 'Search_Type=Activity' in the hyperlink using the following code:我一直在使用以下代码在超链接中搜索'Search_Type=Activity' ：

for elem in soup(href=lambda href: href and "Search_Type=Activity" in href):
    print (elem.parent)

Because I'm crawling a bunch of NIH grant pages, and I need content of "Activity Code" and they all appear right after the hyperlink with terms 'Search_Type=Sctivity' in it.因为我正在抓取一堆 NIH 授权页面，我需要“活动代码”的内容，它们都出现在超链接之后，其中包含“Search_Type=Sctivity”。

So here is the HTML content that I've narrowed down using the code:所以这里是我使用代码缩小的 HTML 内容：

<div class="col-md-8 datacolumn"> <a href="//grants.nih.gov/grants/funding/ac_search_results.htm?text_curr=u01&amp;Search.x=0&amp;Search.y=0&amp;Search_Type=Activity">U01</a> Research Project – Cooperative Agreements
        <!--</div>
                </div> end row -->
<!-- If it is not the first row we close the previous row div tags -->
</div>

FYI, The original page used is just an NIH grant here .仅供参考，这里使用的原始页面只是 NIH 授权。
Could someone point out what that element is and how to get it out from there?有人能指出那个元素是什么以及如何从那里得到它吗？

Answer 1

Try:尝试：

import requests
from bs4 import BeautifulSoup

url = "https://grants.nih.gov/grants/guide/rfa-files/RFA-DK-19-501.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

name = (
    soup.select_one('[href*="Search_Type=Activity"]')
    .find_next_sibling(text=True)
    .strip()
)
print(name)

Prints:印刷：

Research Project – Cooperative Agreements

BeautifulSoup - 使用过滤文本提取特定元素

问题描述

1 个解决方案

解决方案1
1 2022-01-31 18:27:06

BeautifulSoup - 使用过滤文本提取特定元素

问题描述

1 个解决方案

解决方案1 1 2022-01-31 18:27:06

解决方案1
1 2022-01-31 18:27:06