簡體   English   中英

使用 BeautifulSoup 從頁面中抓取所有結果

[英]Scraping all results from page with BeautifulSoup

                              **Update**
         ===================================================

好的,到目前為止一切都很好。 我有允許我抓取圖像的代碼,但它以一種奇怪的方式存儲它們。 它首先下載 40 多個圖像,然后在之前創建的“kittens”文件夾中創建另一個“kittens”文件夾並重新開始(下載與第一個文件夾中相同的圖像)。 我怎樣才能改變它? 這是代碼:

from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.common.exceptions import WebDriverException
from bs4 import BeautifulSoup as soup
import requests
import time
import os

image_tags = []

driver = webdriver.Chrome()
driver.get(url='https://www.pexels.com/search/kittens/')
last_height = driver.execute_script('return document.body.scrollHeight')

while True:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(1)
new_height = driver.execute_script('return document.body.scrollHeight')
if new_height == last_height:
    break
else:
    last_height = new_height

sp = soup(driver.page_source, 'html.parser')

for img_tag in sp.find_all('img'):
    image_tags.append(img_tag)


if not os.path.exists('kittens'):
    os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
    try:
        url = image['src']
        source = requests.get(url)
        with open('kitten-{}.jpg'.format(x), 'wb') as f:
            f.write(requests.get(url).content)
            x += 1
    except:
        pass

================================================== ==========================

我正在嘗試編寫一個蜘蛛來從某個頁面上抓取小貓的圖像。 我有一個小問題,因為我的蜘蛛只能獲得前 15 張圖像。 我知道這可能是因為頁面在向下滾動后加載了更多圖像。 我該如何解決這個問題? 這是代碼:

import requests
from bs4 import BeautifulSoup as bs
import os


url = 'https://www.pexels.com/search/cute%20kittens/'

page = requests.get(url)
soup = bs(page.text, 'html.parser')

image_tags = soup.findAll('img')

if not os.path.exists('kittens'):
    os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
    try:
        url = image['src']
        source = requests.get(url)
        if source.status_code == 200:
            with open('kitten-' + str(x) + '.jpg', 'wb') as f:
                f.write(requests.get(url).content)
                f.close()
                x += 1
    except:
        pass

由於站點是動態的,因此您需要使用瀏覽器操作工具,例如selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import os
driver = webdriver.Chrome()
driver.get('https://www.pexels.com/search/cute%20kittens/')
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
  driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  time.sleep(0.5)
  new_height = driver.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
     break
  last_height = new_height

image_urls = [i['src'] for i in soup(driver.page_source, 'html.parser').find_all('img')]
if not os.path.exists('kittens'):
  os.makedirs('kittens')
os.chdir('kittens')
with open('kittens.txt') as f:
  for url in image_urls:
    f.write('{}\n'.format(url))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM