简体   繁体   English

网页抓取发现没有移动到下一个项目

[英]Web Scraping find not moving on to next item

from bs4 import BeautifulSoup
import requests


def kijiji():
    source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
    soup = BeautifulSoup(source,'lxml')
    b = soup.find('div', class_='price')
    for link in soup.find_all('a',class_ = 'title'):
        a = link.get('href')
        fulllink = 'http://kijiji.ca'+a
        print(fulllink)
        b = soup.find('div', class_='price')
        print(b.prettify())
kijiji()

Usage of this is to sum up all the different kinds of items sold in kijiji and pair them up with a price.用法是总结kijiji中出售的所有不同种类的物品,并将它们与价格配对。 But I can't seem to find anyway to increment what beautiful soup is finding with a class of price, and I'm stuck with the first price.但我似乎无论如何都找不到用一类价格来增加美味汤的价值,而且我坚持第一个价格。 Find_all doesn't work either as it just prints out the whole blob instead of grouping it together with each item. Find_all 也不起作用,因为它只是打印出整个 blob,而不是将它与每个项目组合在一起。

If you have Beautiful soup 4.7.1 or above you can use following css selector select() which is much faster.如果您有 Beautiful Soup 4.7.1 或更高版本,您可以使用以下 css 选择器select() ,它会快得多。

code:代码:

import requests
from bs4 import BeautifulSoup

res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.select('.info-container'):
    fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
    print(fulllink)
    price=item.select_one('.price').text.strip()
    print(price)

Or to use find_all() use below code block或者使用find_all()使用下面的代码块

import requests
from bs4 import BeautifulSoup

res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.find_all('div',class_='info-container'):
    fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
    print(fulllink)
    price=item.find_next(class_='price').text.strip()
    print(price)

Congratulations on finding the answer.恭喜你找到了答案。 I'll give you another solution for reference only.我给你另一种解决方案,仅供参考。

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
def kijiji():
  url = 'https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274'
  source = requests.get(url).text
  doc = SimplifiedDoc(source)
  infos = doc.getElements('div',attr='class',value='info-container')
  for info in infos:
    price = info.select('div.price>text()')
    a = info.select('a.title')
    link = doc.absoluteUrl(url,a.href)
    title = a.text
    print (price)
    print (link)
    print (title)
kijiji()

Result:结果:

$310.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/jordan-4-oreo-2015/1485391828
Jordan 4 Oreo (2015)
$560.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/yeezy-boost-350-yecheil-reflectives/1486296645
Yeezy Boost 350 Yecheil Reflectives
...

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples这里有更多例子: https : //github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

from bs4 import BeautifulSoup
import requests


def kijiji():
    source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
    soup = BeautifulSoup(source,'lxml')
    b = soup.find('div', class_='price')
    for link in soup.find_all('a',class_ = 'title'):
        a = link.get('href')
        fulllink = 'http://kijiji.ca'+a
        print(fulllink)
        print(b.prettify())
        b = b.find_next('div', class_='price')
kijiji()

Was stuck on this for an hour, as soon as I posted this on stack I immediately came up with an idea, messy code but works!在这个问题上被困了一个小时,当我将它发布到堆栈上时,我立即想出了一个想法,代码凌乱但有效!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM