Python BeautifulSoup Web抓取

Question

嗨，我是python和web抓取的新手，以下是我从网站获取URL的脚本，但是如果我检查网站，可以看到URL，但我陷入了从class标签获取URL的麻烦，但是在我的脚本中它显示为javascript 这是任何帮助的链接，请提前谢谢

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
url = "https://www.northcoastelectric.com/Products"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
something = soup.find(class_="clearAfter")
print(chips)
for i in something:
   new_url = i.a["href"]
   print(new_url)`

Answer 1

你应该find_all类cimm_categoryItemBlock代替clearAfter因为那是的类名li包含了产品的链接

something = soup.find_all(class_="cimm_categoryItemBlock")
for i in something:
    new_url = i.a.get("href")
    print(new_url)

Answer 2

您只需要再深入一层即可。 尝试这个：

something = soup.find(class_="clearAfter").findNext("clearAfter")

只要继续像上面一样在“ something”变量上添加“ findNext”命令即可（假设每个链接的类名都相同），您将获得这些链接。

请记住：Beautifulsoup（和HTML）可以有很多分支。 当您创建Beautifulsoup的实例时，常见的说法是您正在创建新的“树”。 那么，如果其他所有方法都失败了？ 只需创建另一个实例，然后尝试使用不同的分支/不同的方式（您可能在这里不需要），您将大放异彩。 HTML可以非常嵌入。

否则，您可以使用硒。 超级简单：

只需使用selenium命令按名称（在您的情况下为clearAfter）收集页面上的所有类，对其进行迭代，追加到列表中并通过“ get_attribute”方法获取href即可。 这是我如何使用硒进行此操作的示例。

    def get_results(self):
        cv = []
        bbb = self.driver.find_elements_by_class_name('user-name') ## self.driver is my Chromedriver webdriver used to manipulate the browser. Let me know if you have Qs!

    for plink in bbb:
           cv.append(plink.find_element_by_css_selector(
                              'a').get_attribute('href'))

希望我能帮上忙。

Python BeautifulSoup Web抓取

问题描述

2 个解决方案

解决方案1
0 已采纳 2017-09-20 13:14:20

解决方案2
0

Python BeautifulSoup Web抓取

问题描述

2 个解决方案

解决方案1 0 已采纳 2017-09-20 13:14:20

解决方案2 0

解决方案1
0 已采纳 2017-09-20 13:14:20

解决方案2
0