简体   繁体   English

BeautifulSoup,尝试从包含作者姓名的锚标签中提取文本

[英]BeautifulSoup, trying to extract text from anchor tags that contain author names

I am trying to scrape some data fromthis books site.我正在尝试从图书站点抓取一些数据。 I need to extract the title, and the author(s).我需要提取标题和作者。 I was able to extract the titles without much trouble.我能够毫不费力地提取标题。 However, I am having issues to extract the authors when there are more than one, since they appear in the same line, and they belong to separate anchor tags within a header h4.但是,当有多个作者时,我在提取作者时遇到问题,因为它们出现在同一行中,并且它们属于标题 h4 中的单独锚标记。

 <h4> "5 . " <a href="/items/705">The Elements of Style</a> " by " <a href="/authors/5107">William Strunk, Jr</a> ", " <a href="/authors/5108">EB White</a> </h4>

This is what I tried:这是我尝试过的:

book_container = soup.find_all('li', class_='item pb-3 pt-3 border-bottom')

for container in book_container:

# title
title = container.h4.a.text
titles.append(title)

# author(s)
author_s = container.h4.find_all('a')
print('### SECOND FOR LOOP ###')
for a in author_s:
   
    if a['href'].startswith('/authors/'):
        
        print(a.text)
       

I'd like to have two authors in a tuple.我想在一个元组中有两个作者。

This might not be the most pythonic way, but it's a workaround.这可能不是最 Pythonic 的方式,但它是一种解决方法。

newlist = []
for a in author_s:
    if a['href'].startswith('/authors/'):
        if len(author_s)>2:
            newlist.append(a.text)
            print(tuple(newlist))
        else:
            print(a.text)

I'm utilizing the fact that variable author_s would contain a list which we could check for more names.我正在利用变量author_s将包含一个列表的事实,我们可以检查更多名称。 More than 2 in list, means more authors.列表中超过 2 个,意味着更多作者。 (Alternatively, you could also check for the existence of newline in print) (或者,您也可以检查打印中是否存在换行符)

You will also notice the printed output will have two tuples.您还会注意到打印的输出将有两个元组。 Always extract the second one.总是提取第二个。 The rest with one author will remain the same.一位作者的其余部分将保持不变。 Since this request do not have multiple lines of two authors, I couldn't check for complications.由于此请求没有由两位作者组成的多行,因此我无法检查是否存在并发症。

Output:输出:

[<a href="/items/705">The Elements of Style</a>, <a href="/authors/5107">William Strunk, Jr</a>, <a href="/authors/5108">E. B. White</a>]
### SECOND FOR LOOP ###
('William Strunk, Jr',)
('William Strunk, Jr', 'E. B. White')

You can extract all <a> links under <h4> (h4 is the tag where are title/authors).您可以提取<h4>下的所有<a>链接(h4 是标题/作者所在的标签)。 First <a> tag is the title, rest of <a> tags are the authors:第一个<a>标签是标题,其余的<a>标签是作者:

import requests
from bs4 import BeautifulSoup


url = 'https://thegreatestbooks.org/the-greatest-nonfiction-since/1900'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for item in soup.select('h4:has(>a)'):
    elements = [i.get_text(strip=True) for i in item.select('a')]
    title = elements[0]
    authors = elements[1:]
    print('{:<40} {}'.format(title, authors))

Prints:印刷:

The Diary of a Young Girl                ['Anne Frank']
The Autobiography of Malcolm X           ['Alex Haley']
Silent Spring                            ['Rachel Carson']
In Cold Blood                            ['Truman Capote']
The Elements of Style                    ['William Strunk, Jr', 'E. B. White']
The Double Helix: A Personal Account of the Discovery of the Structure of DNA ['James D. Watson']
Relativity                               ['Albert Einstein']
Look Homeward, Angel                     ['Thomas Wolfe']
Homage to Catalonia                      ['George Orwell']
Speak, Memory                            ['Vladimir Nabokov']
The General Theory of Employment, Interest and Money ['John Maynard Keynes']
The Second World War                     ['Winston Churchill']
The Education of Henry Adams             ['Henry Adams']
Out of Africa                            ['Isak Dinesen']
The Structure of Scientific Revolutions  ['Thomas Kuhn']
Dispatches                               ['Michael Herr']
The Gulag Archipelago                    ['Aleksandr Solzhenitsyn']
I Know Why the Caged Bird Sings          ['Maya Angelou']
The Civil War                            ['Shelby Foote']
If This Is a Man                         ['Primo Levi']
Collected Essays of George Orwell        ['George Orwell']
The Electric Kool-Aid Acid Test          ['Tom Wolfe']
Civilization and Its Discontents         ['Sigmund Freud']
The Death and Life of Great American Cities ['Jane Jacobs']
Selected Essays of T. S. Eliot           ['T. S. Eliot']
A Room of One's Own                      ['Virginia Woolf']
The Right Stuff                          ['Tom Wolfe']
The Road to Serfdom                      ['Friedrich von Hayek']
R. E. Lee                                ['Douglas Southall Freeman']
The Varieties of Religious Experience    ['Will James']
The Liberal Imagination                  ['Lionel Trilling']
Angela's Ashes: A Memoir                 ['Frank McCourt']
The Second Sex                           ['Simone de Beauvoir']
Mere Christianity                        ['C. S. Lewis']
Moveable Feast                           ['Ernest Hemingway']
The Autobiography of Alice B. Toklas     ['Gertrude Stein']
The Origins of Totalitarianism           ['Hannah Arendt']
Black Lamb and Grey Falcon               ['Rebecca West']
Orthodoxy                                ['G. K. Chesterton']
Philosophical Investigations             ['Ludwig Wittgenstein']
Night                                    ['Elie Wiesel']
The Affluent Society                     ['John Kenneth Galbraith']
Mythology                                ['Edith Hamilton']
The Open Society                         ['Karl Popper']
The Color of Water: A Black Man's Tribute to His White Mother ['James McBride']
The Seven Storey Mountain                ['Thomas Merton']
Hiroshima                                ['John Hersey']
Let Us Now Praise Famous Men             ['James Agee']
Pragmatism                               ['Will James']
The Making of the Atomic Bomb            ['Richard Rhodes']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM