简体   繁体   中英

AttributeError when using soup.find_all

I was trying to build a web-scraper for data collection for a research project at uni. However, I am not able to scrape the whole website, as there seems to be a problem with soup.find_all ...

This is what I've come up with so far:

from bs4 import BeautifulSoup 
import requests
from csv import writer

url= "https://pubmed.ncbi.nlm.nih.gov/?term=(%22spontaneous%20intracranial%20hypotension%22%5BAll%20Fields%5D%20OR%20%22spontaneous%20cerebrospinal%20fluid%20leak%22%5BAll%20Fields%5D%20OR%20%22cerebrospinal%20fluid%20hypovolemia%22%5BAll%20Fields%5D%20OR%20%22cerebrospinal%20fluid%20hypovolemia%20syndrome%22%5BAll%20Fields%5D%20OR%20%22Hypoliquorrhea%22%5BAll%20Fields%5D%20OR%20%22Spontaneous%20spinal%20cerebrospinal%20fluid%20leak%22%5BAll%20Fields%5D)%20NOT%20%22letter%20to%20the%20editor%22%5BAll%20Fields%5D&filter=dates.1000%2F1%2F1-2022%2F3%2F31&filter=lang.english&ac=no&format=abstract&sort=date&size=200"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('article', class_="article-overview")

with open('disstest.csv', 'w', encoding= 'utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Herkunftsland', 'Journal', 'Anzahl Zitationen']
    thewriter.writerow(header)

    for list in lists:
        herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','')
        journal = lists.find('div', class_="article-source").text.replace('\n', '')
        zitationen = lists.find('li', class_="references-count").text.replace('\n', '')
        info = [herkunftsland, journal, zitationen]
        thewriter.writerow(info)

I am getting the following messages:

Traceback (most recent call last):  
File "/Users/***/Documents/Test/scrape.py", line 17, in <module>     
herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','') 
File"/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/bs4/element.py", line 2289, in __getattr__     
raise AttributeError( 
AttributeError: ResultSet object has no attribute 'find'. 
You're probably treating a list of elements like a single element. 
Did you call find_all() when you meant to call find()?

It looks like you made a mistake and use the lists list to search, but you should use _list

for _list in lists:
    herkunftsland = _list.find('ul', class_="item-list").text.replace('\n', '')
    journal = _list.find('div', class_="article-source").text.replace('\n', '')
    zitationen = _list.find('li', class_="references-count").text.replace('\n', '')
    info = [herkunftsland, journal, zitationen]
    thewriter.writerow(info)

As mentioned by @Charls Ken you used the wrong variable lists to extract your data and you should also avoid using reserved keywords like list .

Would also recommend to check if elements are available before calling methods on them, to avoid AttributeError s.

for _list in lists:
    herkunftsland = e.text.replace('\n','') if (e:= _list.find('ul', class_="item-list")) else None
    journal = e.text.replace('\n','').strip() if (e:= _list.find('div', class_="article-source")) else None
    zitationen = e.text.replace('\n','').strip() if (e:= _list.find('li', class_="references-count")) else None
    info = [herkunftsland, journal, zitationen] 

Note: This uses walrus operator that requires Python 3.8 or later to work.

To go without walrus operator :

journal = _list.find('div', class_="article-source").text.replace('\n','').strip() if _list.find('div', class_="article-source") else None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM