简体   繁体   English

Python 美汤 find_all()

[英]Python Beautiful Soup find_all()

I am trying to use find_all() on the below html;我正在尝试在下面的 html 上使用 find_all();

http://www.simon.com/mall http://www.simon.com/mall

Based on advice on other threads, I ran the link through the below site and it found errors, but I am not sure how the errors shown may be hurting what I am trying to do in Beautiful Soup.根据其他线程的建议,我通过以下站点运行链接并发现错误,但我不确定显示的错误如何影响我在 Beautiful Soup 中尝试做的事情。

https://validator.w3.org/ https://validator.w3.org/

Here is my code;这是我的代码;

from requests import get

url = 'http://www.simon.com/mall'
response = get(url)

from bs4 import BeautifulSoup

html = BeautifulSoup(response.text, 'html5lib')
mall_list = html.find_all('div', class_ = 'col-xl-4 col-md-6 ')

print(type(mall_list))
print(len(mall_list))

The result is;结果是;

"C:\Program Files\Anaconda3\python.exe" C:/Users/Chris/PycharmProjects/IT485/src/GetMalls.py
<class 'bs4.element.ResultSet'>
0

Process finished with exit code 0

I know there are hundreds of these divs in the HTML.我知道 HTML 中有数百个这样的 div。 Why am I not getting any matches?为什么我没有得到任何匹配?

I sometime use BeautifulSoup too.我有时也会使用 BeautifulSoup。 The problem lies in the way you get the attributes.问题在于你获取属性的方式。 The full working code can be seen bellow:完整的工作代码如下所示:

import requests
from bs4 import BeautifulSoup

url = 'http://www.simon.com/mall'
response = requests.get(url)
html = BeautifulSoup(response.text)
mall_list = html.find_all('div', attrs={'class': 'col-lg-4 col-md-6'})[1].find_all('option')
malls = []

for mall in mall_list:
    if mall.get('value') == '':
        continue
    malls.append(mall.text)

print(malls)
print(type(malls))
print(len(malls))

Your code looks fine, however, when I visit the simon.com/mall link and check Chrome Dev Tools there doesn't seem to be any instances of the class 'col-xl-4 col-md-6 '.您的代码看起来不错,但是,当我访问 simon.com/mall 链接并检查 Chrome Dev Tools 时,似乎没有“col-xl-4 col-md-6”类的任何实例。

Try testing your code with 'col-xl-2' and you should see some results.尝试使用“col-xl-2”测试您的代码,您应该会看到一些结果。

Assuming that you are trying to parse the title and location of different products from that page (mentioned in your script).假设您正在尝试从该页面(在您的脚本中提到)解析不同产品的标题和位置。 The thing is the content of that page are generated dynamically so you can't catch it with requests;问题是该页面的内容是动态生成的,因此您无法通过请求捕获它; rather, you need to use any browser simulator like selenium that is What i did in my below code.相反,您需要使用任何浏览器模拟器,如 selenium,这是我在下面的代码中所做的。 Give this a try:试试这个:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome()
driver.get('http://www.simon.com/mall')
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

for item in soup.find_all(class_="mall-list-item-text"):
    name = item.find_all(class_='mall-list-item-name')[0].text
    location = item.find_all(class_='mall-list-item-location')[0].text
    print(name,location)

Results:结果:

ABQ Uptown Albuquerque, NM
Albertville Premium Outlets® Albertville, MN
Allen Premium Outlets® Allen, TX
Anchorage 5th Avenue Mall Anchorage, AK
Apple Blossom Mall Winchester, VA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM