繁体   English   中英

使用 Selenium、PhantomJS 和 BS4 进行刮擦

[英]Scraping With Selenium, PhantomJS & BS4

I'm currently using Windows 10 and Python 3.7 and I've been reading about how to scrape without opening up 1 Firefox browser window for each URLs that's being scraped in the urls list. 下面的代码抛出了一个错误,我确信它与 PhantomJS 的实现方式有关,我只是不知道具体是什么。

我读过 PhantomJS 是与 Selenium 一起使用时的解决方案。 我安装了 PJS,在我的计算机上设置了路径,它似乎正在运行,但是我不完全确定如何在代码本身中实现它。

driver = webdriver.PhantomJS(executable_path=r"C:\phantomjs")行是试图运行 PJS 的行。 在使用driver = webdriver.Firefox()之前,代码工作得很好。

urls = ["https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=0&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=90&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=270&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=360&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=450&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=540&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=630&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=720&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=810&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=900&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=990&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"]
#url = "https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"

user_agent = UserAgent()

#make csv file
csv_file = open("gcscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])

for url in urls:
    web_r = requests.get(url)
    web_soup = BeautifulSoup(web_r.text,"html.parser")

        #print(web_soup.findAll("li", class_="product-container")) #finding all of the grid items on the url above - price, photo, image, details and all
        #print(len(web_soup.findAll("li", class_="product-container"))) #printing out the length of the

    #driver = webdriver.Firefox()
    driver = webdriver.PhantomJS(executable_path=r"C:\phantomjs")
    driver.get(url)
    html = driver.execute_script("return document.documentElement.outerHTML") #whats inside of this is a javascript call to get the outer html content of the page
    sel_soup = BeautifulSoup(html, "html.parser")

    for content in sel_soup.findAll("li", class_="product-container"):
            #print(content)

        bass_name = content.find("div", class_="productTitle").text.strip() #pulls the bass guitar name
        print(bass_name)

        prices_new = []
        for i in content.find("span", class_="productPrice").text.split("$"):
            prices_new.append(i.strip())
        bp = prices_new[1]
        print(bp)

        #write row to new csv file
        csv_writer.writerow([bass_name, bp])

确保在此处为您的操作系统下载正确的 PhantomJs 发行版。

对于 Windows,以下代码行应该可以正常工作:

driver = webdriver.PhantomJS("C://phantomjs.exe")
driver.get(url)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM