简体   繁体   English

Python bs4 不从元素返回文本

[英]Python bs4 not returning text from elements

I am trying to scrape reverb.com to get the names of different instruments.我正在尝试抓取 reverb.com 以获取不同乐器的名称。 I have found the element that holds the instrument name text, but for some reason the tags return blank.我找到了包含乐器名称文本的元素,但由于某种原因,标签返回空白。 I will provide my code below.我将在下面提供我的代码。 Any ideas as to why this might be happening?关于为什么会发生这种情况的任何想法?

import requests
from bs4 import BeautifulSoup as Soup

url = 'https://reverb.com/marketplace?query=jackson+guitars'

response = requests.get(url).text
soup = Soup(url, 'html.parser')
for item in soup.find_all('h4', class_='grid-card__title'):
    print(item.text)

If you go to the web site, there are many results for the search.如果你go到web网站,有很多搜索结果。 The most I've been able to return is four results, and I usually get back an empty list.我能够返回的最多是四个结果,而且我通常会返回一个空列表。 I checked, and they all appear to have that h4 with the same class. Any ideas as to why it is not returning all results?我查了一下,他们似乎都有相同的 class 的 h4。关于为什么它不返回所有结果的任何想法? Thanks for your time.谢谢你的时间。

In line 6, you are using the requests library to get the content under the URL provided.在第 6 行中,您正在使用请求库获取提供的 URL 下的内容。 In the next step, you should use the response of this request as input for Beautiful Soup, not the URL address.在下一步中,您应该使用此请求的响应作为 Beautiful Soup 的输入,而不是 URL 地址。 Just update this single line:只需更新这一行:

soup = Soup(response, 'html.parser')

I did some research and noticed that the site does not immediately load all the content.我做了一些研究,发现该站点不会立即加载所有内容。

In order for all product titles to be successfully extracted, you need to use scroll pagination technique to scroll the page to the end and wait until the data is loaded.为了成功提取所有产品标题,您需要使用滚动分页技术将页面滚动到最后并等待数据加载。 To accomplish this task, I decided to use playwright .为了完成这个任务,我决定使用playwright

The evaluate() method takes as a parameter the JavaScript code needed to scroll the page. evaluate()方法将滚动页面所需的 JavaScript 代码作为参数。 After scrolling, you need to wait a while for the data to load:滚动后,需要等待一段时间数据加载:

page.evaluate('window.scrollTo(0, document.querySelector("body").scrollHeight)')
time.sleep(3)

We pass the page with the loaded data to the soup object, extract the titles of the products and close the browser:我们将加载数据的页面传递给soup object,提取产品标题并close浏览器:

soup = BeautifulSoup(page.content(), 'lxml')

for item in soup.select('.grid-card__title'):
    print(item.text)

page.close()

Full code:完整代码:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import time, lxml


def run(playwright):
    URL = 'https://reverb.com/marketplace?query=jackson+guitars'

    page = playwright.chromium.launch(headless=True).new_page()
    page.goto(URL)

    page.evaluate('window.scrollTo(0, document.querySelector("body").scrollHeight)')
    time.sleep(3)

    soup = BeautifulSoup(page.content(), 'lxml')

    for item in soup.select('.grid-card__title'):
        print(item.text)

    page.close()


with sync_playwright() as playwright:
    run(playwright)

Output: Output:

JACKSON Jackson Pro Series Rhoads RR24 Maul Crackle (S/N:CYJ21000010) (11/03)
Final Reduction!  2005 Jackson USA Soloist SL-1 in Gloss Black Finish! OHSC
Jackson X Series SL3X DX Soloist Frost Byte Crackle
Jackson JS Series JS32T Jackson Trem Black/White
Jackson Jackson RRX24  Black with Yellow Bevels
... ohter results

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM