简体   繁体   English

使用BeautifulSoup从图像标签Src属性中提取JPG

[英]Extract JPG From Image Tag Src Attribute With BeautifulSoup

I am scraping this webpage for personal use https://asheville.craigslist.org/search/fua and running into issues extracting the thumbnails of each item on the page. 我正在抓取此网页供个人使用, 网址https://asheville.craigslist.org/search/fua,并在提取页面上每个项目的缩略图时遇到了问题。 When I use "inspect" to view the html DOM I can view the image tags that contain the .jpg's I need, but when I use "view page source", img tags don't show up. 当我使用“检查”查看html DOM时,我可以查看包含我需要的.jpg的图像标签,但是当我使用“查看页面源代码”时,img标签不会显示。 At first I thought this might be an asynchronous javascript loading issue but I was told by a credible source I should be able to scrape the thumbnails directly with beautifulsoup. 起初我以为这可能是异步javascript加载问题,但可靠的消息告诉我,我应该可以直接用beautifulsoup抓取缩略图。

import lxml
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

    r = requests.get("https://asheville.craigslist.org/search/fua", params=dict(postal=28804), headers={"user-agent":ua.chrome})
    soup = BeautifulSoup(r.content, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.findAll("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.findAll("img", {'alt class': 'thumb'}):
                print(pic['src'])

Can someone clarify what I'm misunderstanding here? 有人可以澄清我对这里的误解吗? The value from the href attribute of the "a" tag will print but I can't seem to get the src attribute of the "img" tag to print. 来自“ a”标签的href属性的值将被打印,但是我似乎无法获取“ img”标签的src属性进行打印。 Thanks in advance! 提前致谢!

I'm able to read the img tags with the following code: 我可以使用以下代码读取img标签:

for post in soup.find_all('li', "result-row"):
    for post_content in post.find_all("a", "result-image gallery"):
        print(post_content['href'])
        for pic in post_content.find_all("img"):
            print(pic['src'])

Just a few thoughts about scraping from craigslist: 关于从craigslist进行抓取的一些想法:

  • Limit your requests per second. 每秒限制您的请求。 I have heard that craigslist will put a temporary block on your IP address if you exceed a certain frequency of requests. 我听说如果您超过一定频率的请求,craigslist会在您的IP地址上放置一个临时阻止。

  • Each posts seemed to load between one and two images. 每个帖子似乎加载一到两个图像。 On closer inspection, the carousel images are not loaded in unless you click on the arrows. 仔细检查后,除非您单击箭头,否则不会加载轮播图像。 If you need each photo for each post, you should find a different way to write the script, possibly by visiting the link for each post that has multiple images. 如果每个帖子都需要每张照片,则应该找到一种不同的脚本编写方式,方法是访问具有多个图像的每个帖子的链接。

Also, I think it's great to use selenium for web scraping. 另外,我认为使用硒进行网页抓取非常好。 You may not need it for this project but it will allow you to do a lot more things like clicking on buttons, entering form data, etc. Here's the quick script I used to scrape the data using Selenium: 对于该项目,您可能不需要它,但是它将允许您做更多的事情,例如单击按钮,输入表单数据等。这是我过去使用Selenium抓取数据的快速脚本:

import lxml
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def test():
    url = "https://asheville.craigslist.org/search/fua"
    driver = webdriver.Firefox()
    driver.get(url)
    html = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(html, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.find_all("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.find_all("img"):
                print(pic['src'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM