使用美麗的湯從 HTML 中提取特定的標題

Question

這是我使用的專利示例https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry 。 下面是我使用的代碼。 我希望代碼只顯示被引用的 (3) 計數，所以我知道該專利被引用了多少次。如何讓輸出顯示被引用的計數僅顯示為 3？ 請幫助！

 
soup = BeautifulSoup(patent, 'html.parser')
cited_section =soup.findAll({"h2":"Cited By"})

print(cited_section)
Output I get is [<h2>Info</h2>, <h2>Links</h2>, <h2>Images</h2>, <h2>Classifications</h2>, <h2>Abstract</h2>, <h2>Description</h2>, <h2>Claims (<span itemprop="count">57</span>)</h2>, <h2>Priority Applications (5)</h2>, <h2>Applications Claiming Priority (1)</h2>, <h2>Related Parent Applications (1)</h2>, <h2>Publications (2)</h2>, <h2>ID=38925605</h2>, <h2>Family Applications (1)</h2>, <h2>Country Status (1)</h2>, <h2>Cited By (3)</h2>, <h2>Families Citing this family (12)</h2>, <h2>Citations (306)</h2>, <h2>Patent Citations (348)</h2>, <h2>Non-Patent Citations (23)</h2>, <h2>Cited By (4)</h2>, <h2>Also Published As</h2>, <h2>Similar Documents</h2>, <h2>Legal Events</h2>]````

Answer 1

引用數量是通過 JavaScript 動態創建的。 但是您可以使用itemprop="forwardReferencesFamily"計算元素的數量以獲取計數。 例如：

import requests
from bs4 import BeautifulSoup


url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

print(len(soup.select('tr[itemprop="forwardReferencesFamily"]')))

印刷：

Answer 2

嗨在這個鏈接https://patents.google.com/patent/WO2012061469A3/en?oq=medicinal+chemistry我想要代碼打印專利引文，應該給出出版號，標題。 然后我想使用 Pandas 將出版物編號放在一列中，並將標題放在另一列中。 到目前為止，我已經使用過漂亮的湯將 HTML 文件轉換為可讀格式。我選擇了向后引用 HTML 標簽，並在其下打印了引用的出版物編號和標題。 我只舉一個例子，但我有一個裝滿 HTML 文件的文件夾，我稍后會做。

x=soup.select('tr[itemprop="backwardReferences"]') 
y=soup.select('td[itemprop="title"]') # this line gives all the titles in the document not particularly under the patent citations
print(y)

使用美麗的湯從 HTML 中提取特定的標題

問題描述

2 個解決方案

解決方案1
1 已采納 2020-09-07 08:05:07

解決方案2
0 2021-04-24 01:24:08

使用美麗的湯從 HTML 中提取特定的標題

問題描述

2 個解決方案

解決方案1 1 已采納 2020-09-07 08:05:07

解決方案2 0 2021-04-24 01:24:08

解決方案1
1 已采納 2020-09-07 08:05:07

解決方案2
0 2021-04-24 01:24:08