簡體   English   中英

為什么我的 web 刮刀不工作? Python3 - 請求,BeautifulSoup

[英]why doesn't my web scraper work? Python3 - requests, BeautifulSoup

我一直在關注這個 python 教程一段時間,我做了一個 web 爬蟲,類似於視頻中的那個。

語言:Python

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text,  'html.parser')
        for link in soup.findAll('a', {'class':'item-title'}):
            href = link.get('href')
            title = link.string
            print(href)
        page += 1

spider(1)

這是程序給出的 output:

PS D:\development> & C:/Users/hirusha/AppData/Local/Programs/Python/Python38/python.exe "d:/development/Python/TheNewBoston/Python/one/web scrawler.py"n/TheNewBoston/Python/one/web scrawler.py"
PS D:\development>

我能做些什么?


在此之前,我有一個錯誤,代碼是:

soup = BeautifulSoup(plain_text)

我把它改成了

soup = BeautifulSoup(plain_text,  'html.parser')

錯誤消失了,

我在這里遇到的錯誤是:

d:/development/Python/TheNewBoston/Python/one/web scrawler.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 10 of the file d:/development/Python/TheNewBoston/Python/one/web scrawler.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

  soup = BeautifulSoup(plain_text)

任何幫助表示贊賞,謝謝!

沒有結果,因為您定位的 class 在呈現網頁之前不存在,而請求不會發生這種情況。

數據是從script標簽動態檢索的。 您可以正則表達式 JavaScript object 保存數據並使用 json 解析以獲取該信息。

您顯示的錯誤是由於最初未指定解析器; 你糾正了。

import re, json, requests
import pandas as pd

r = requests.get('https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=1')
data = json.loads(re.search(r'window\.runParams = (\{".*?\});', r.text, re.S).group(1))
df = pd.DataFrame([(item['title'], 'https:' + item['productDetailUrl']) for item in data['items']])
print(df)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM