為什么我的 web 刮刀不工作？ Python3 - 請求，BeautifulSoup

Question

我一直在關注這個 python 教程一段時間，我做了一個 web 爬蟲，類似於視頻中的那個。

語言：Python

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text,  'html.parser')
        for link in soup.findAll('a', {'class':'item-title'}):
            href = link.get('href')
            title = link.string
            print(href)
        page += 1

spider(1)

這是程序給出的 output：

PS D:\development> & C:/Users/hirusha/AppData/Local/Programs/Python/Python38/python.exe "d:/development/Python/TheNewBoston/Python/one/web scrawler.py"n/TheNewBoston/Python/one/web scrawler.py"
PS D:\development>

我能做些什么？

在此之前，我有一個錯誤，代碼是：

soup = BeautifulSoup(plain_text)

我把它改成了

soup = BeautifulSoup(plain_text,  'html.parser')

錯誤消失了，

我在這里遇到的錯誤是：

d:/development/Python/TheNewBoston/Python/one/web scrawler.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 10 of the file d:/development/Python/TheNewBoston/Python/one/web scrawler.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

  soup = BeautifulSoup(plain_text)

任何幫助表示贊賞，謝謝！

Answer 1

沒有結果，因為您定位的 class 在呈現網頁之前不存在，而請求不會發生這種情況。

數據是從script標簽動態檢索的。 您可以正則表達式 JavaScript object 保存數據並使用 json 解析以獲取該信息。

您顯示的錯誤是由於最初未指定解析器； 你糾正了。

import re, json, requests
import pandas as pd

r = requests.get('https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=1')
data = json.loads(re.search(r'window\.runParams = (\{".*?\});', r.text, re.S).group(1))
df = pd.DataFrame([(item['title'], 'https:' + item['productDetailUrl']) for item in data['items']])
print(df)

為什么我的 web 刮刀不工作？ Python3 - 請求，BeautifulSoup

問題描述

1 個解決方案

解決方案1
2 已采納 2021-05-01 10:30:06

為什么我的 web 刮刀不工作？ Python3 - 請求，BeautifulSoup

問題描述

1 個解決方案

解決方案1 2 已采納 2021-05-01 10:30:06

解決方案1
2 已采納 2021-05-01 10:30:06