I was trying to build a web-scraper for data collection for a research project at uni. However, I am not able to scrape the whole website, as there seems to be a problem with soup.find_all
...
This is what I've come up with so far:
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "https://pubmed.ncbi.nlm.nih.gov/?term=(%22spontaneous%20intracranial%20hypotension%22%5BAll%20Fields%5D%20OR%20%22spontaneous%20cerebrospinal%20fluid%20leak%22%5BAll%20Fields%5D%20OR%20%22cerebrospinal%20fluid%20hypovolemia%22%5BAll%20Fields%5D%20OR%20%22cerebrospinal%20fluid%20hypovolemia%20syndrome%22%5BAll%20Fields%5D%20OR%20%22Hypoliquorrhea%22%5BAll%20Fields%5D%20OR%20%22Spontaneous%20spinal%20cerebrospinal%20fluid%20leak%22%5BAll%20Fields%5D)%20NOT%20%22letter%20to%20the%20editor%22%5BAll%20Fields%5D&filter=dates.1000%2F1%2F1-2022%2F3%2F31&filter=lang.english&ac=no&format=abstract&sort=date&size=200"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('article', class_="article-overview")
with open('disstest.csv', 'w', encoding= 'utf8', newline='') as f:
thewriter = writer(f)
header = ['Herkunftsland', 'Journal', 'Anzahl Zitationen']
thewriter.writerow(header)
for list in lists:
herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','')
journal = lists.find('div', class_="article-source").text.replace('\n', '')
zitationen = lists.find('li', class_="references-count").text.replace('\n', '')
info = [herkunftsland, journal, zitationen]
thewriter.writerow(info)
I am getting the following messages:
Traceback (most recent call last):
File "/Users/***/Documents/Test/scrape.py", line 17, in <module>
herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','')
File"/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/bs4/element.py", line 2289, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'.
You're probably treating a list of elements like a single element.
Did you call find_all() when you meant to call find()?
It looks like you made a mistake and use the lists
list to search, but you should use _list
for _list in lists:
herkunftsland = _list.find('ul', class_="item-list").text.replace('\n', '')
journal = _list.find('div', class_="article-source").text.replace('\n', '')
zitationen = _list.find('li', class_="references-count").text.replace('\n', '')
info = [herkunftsland, journal, zitationen]
thewriter.writerow(info)
As mentioned by @Charls Ken you used the wrong variable lists
to extract your data and you should also avoid using reserved keywords like list
.
Would also recommend to check if elements are available before calling methods on them, to avoid AttributeError
s.
for _list in lists:
herkunftsland = e.text.replace('\n','') if (e:= _list.find('ul', class_="item-list")) else None
journal = e.text.replace('\n','').strip() if (e:= _list.find('div', class_="article-source")) else None
zitationen = e.text.replace('\n','').strip() if (e:= _list.find('li', class_="references-count")) else None
info = [herkunftsland, journal, zitationen]
Note: This uses walrus operator
that requires Python 3.8
or later to work.
To go without walrus operator
:
journal = _list.find('div', class_="article-source").text.replace('\n','').strip() if _list.find('div', class_="article-source") else None
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.