简体   繁体   中英

Need help scraping images from a slideshow with bs4 & python

I'm trying scrap listing information from Craigslist, unfortunately I can't seem to get the images since they are in a slideshow.

import requests
from bs4 import BeautifulSoup as soup

url = "https://newyork.craigslist.org/search/sss"
r = requests.get(url)
souped = soup(r.content, 'lxml')

Since the images aren't even in the html file requested, do I need to somehow dynamically load the page or something. If so can I keep it only in python, I don't want any other dependencies. Thanks in advance, pretty new to this so any help would be helpful.

Look for the A tags with classes result-image gallery . Each of those tags have a data-ids attribute which olds part of the names of the images files.

<a href="https://newyork.craigslist.org/mnh/fuo/d/new-york-city-3-piece-shaped-ikea-couch/6812749499.html" class="result-image gallery" data-ids="1:00707_iRUU5VKwkWi,1:00H0H_6AIBqK2iQDU">
           ....
</a>

Now, if you want to get the urls, first get that attribute and parse the partial image's names (on that example, 00707_iRUU5VKwkWi and 00H0H_6AIBqK2iQDU ).

And now you can build the urls with the host and, the suffix ( _300x300 ) and the extension:

https://images.craigslist.org/00707_iRUU5VKwkWi_300x300.jpg
https://images.craigslist.org/00H0H_6AIBqK2iQDU_300x300.jpg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM