简体   繁体   中英

How can I scrape this site using selenium on headless mode?

I want to scrape this site's( https://www.monotaro.com/p/8928/5682/ ) information using selenium on Ubuntu on docker. So, I want to use chromedriver on headless mode, but my script can not get specified information when I use headless mode.

When I run the scraping test program without headless mode on mac, I can get the specified information.

Please help me.

url = "https://www.monotaro.com/p/8928/5682/"
options = webdriver.chrome.options.Options()
#options.add_argument('--headless') # when I use headless mode, I can't get the information.
#options.add_argument('--disable-gpu')
self.browser = webdriver.Chrome("/.../chromedriver",chrome_options=options)
self.browser.get(url)
self.browser.implicitly_wait(10)

self.html = self.browser.page_source
self.soup = BeautifulSoup(self.html, "html.parser")

brand = self.soup.find("span", class_="itd_brand")
print(brand)
brand = brand.get_text().replace('\n','')
print(brand)

When I run this program without headless-mode, I can get the wanted tag and the information.

<span class="itd_brand">
<a href="/brand/907/"> <strong class="st itd_all_size">TRUSCO</strong>
</a> </span>
 TRUSCO 

However, I cannot get the these tags using headless-mode.

None
Traceback (most recent call last):
  File "/Users/plugins/webScraper.py", line 82, in <module>
    print(monotaro.GetBrand())
  File "/Users/plugins/webScraper.py", line 59, in GetBrand
    brand = brand.get_text().replace('\n','')
AttributeError: 'NoneType' object has no attribute 'get_text'

I tried set the delayed time to get these tag using "implicitly_wait", but I could not get the specific tag.

I solved this problem using xvfb package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM