簡體   English   中英

解析Wiley在線圖書館

[英]Parse Wiley Online Library

我想從Ullmann的《使用Python和BeautifulSoup 的工業化學百科全書》中提取所有章節的DOI。

所以從

<h2 class="meta__title meta__title__margin"><span class="hlFld-Title"><a href="/doi/10.1002/14356007.c01_c01.pub2">Aerogels</a></span></h2>

我想獲得“ Aerogels”和“ /doi/full/10.1002/14356007.c01_c01.pub2”

更大的樣本:

     <ul class="chapter_meta meta__authors rlist--inline comma">
        <li><span class="hlFld-ContribAuthor"><a href="/action/doSearch?ContribAuthorStored=H%C3%BCsing%2C+Nicola"><span>Nicola Hüsing</span></a></span></li>
        <li><span class="hlFld-ContribAuthor"><a href="/action/doSearch?ContribAuthorStored=Schubert%2C+Ulrich"><span>Ulrich Schubert</span></a></span></li>
     </ul><span class="meta__epubDate"><span>First published: </span>15 December 2006</span><div class="content-item-format-links">
        <ul class="rlist--inline separator">
           <li><a title="Abstract" href="/doi/abs/10.1002/14356007.c01_c01.pub2">Abstract</a></li>
           <li><a title="Full text" href="/doi/full/10.1002/14356007.c01_c01.pub2">
                 Full text
                 </a></li>

對於我嘗試過的標題:

span['hlFld-Title'].a

對於我嘗試過的DOI:

for link in soup.find_all('a'.title):
    print(link.get('href'))

但是可悲的是我是一個傻瓜(傻瓜),它不起作用。

網址為https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage= {1..59}

謝謝你的幫助。

這是一個快速解決方案,將“ DOI; title”對打印到命令行:

import requests
from bs4 import BeautifulSoup

for i in range(59):
    page = requests.get("https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage={}".format(i))

    soup = BeautifulSoup(page.content, 'lxml')

    content = soup.findAll("span", class_="hlFld-Title")

    for c in content:
        print(c.a.get('href')+";"+c.get_text())

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM