简体   繁体   中英

Python BeautifulSoup web scraping

Hi i am new to both python and web scraping following is my script to get the URLs from the website but I got stuck in between I cant get the urls from the class tag if I inspect the website I can see the URL but in my script it shows as javascript This is the link any help, please thanks in advance

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
url = "https://www.northcoastelectric.com/Products"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
something = soup.find(class_="clearAfter")
print(chips)
for i in something:
   new_url = i.a["href"]
   print(new_url)`

You should find_all class with cimm_categoryItemBlock instead of clearAfter because that is the class name of the li containing the products' links

something = soup.find_all(class_="cimm_categoryItemBlock")
for i in something:
    new_url = i.a.get("href")
    print(new_url)

You just need to go another layer deep. Try this:

something = soup.find(class_="clearAfter").findNext("clearAfter")

Just keep adding on "findNext"commands exactly like above to the 'something' variable (assuming the class name is the same for each link)and you'll get to the links.

Remember: Beautifulsoup (and HTML) can have many branches. When you create an instance of Beautifulsoup, the common vernacular is that you you are creating a new "tree". So, if all else fails? Just create another instance and try a different branch/a different way (you likely will not need that here) and you'll be golden. HTML can get very embedded.

Otherwise, you could use selenium. Super easy:

Just use the selenium command to collect all the classes on a page by name (in your case, clearAfter), iterate over it, append to a list and grab the href's via "get_attribute" method. Here's an example of how I used selenium to do this.

    def get_results(self):
        cv = []
        bbb = self.driver.find_elements_by_class_name('user-name') ## self.driver is my Chromedriver webdriver used to manipulate the browser. Let me know if you have Qs!

    for plink in bbb:
           cv.append(plink.find_element_by_css_selector(
                              'a').get_attribute('href'))

Hope I helped.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM