简体   繁体   中英

Scraping advice on Crawling and info from Javascript onclick() function

I've finally found a thread on newbie help on this subject but I am no way forward with resolving this issue, partly because I'm a newbie at programming :)

The thread is: Newbie: How to overcome Javascript "onclick" button to scrape web page?

I have a similar issue. The site I would like to scrape from has lots of information of a lot of parts, but I would like to only scrape certain part information (company, part number, etc). I have two issues:

  1. How to grab such information from this site without the need to put in search information? Use a Crawler?

  2. A part number has most of the information on a page but there is on page Javascript 'onclick()' function, when it is clicked opens up a small window displaying information that, in addition to, I would like to scrape. How can I scrape the information in this additional window?

I'm using import.io but have been advised to switch to Selenium and PhantomJS. I would welcome other suggestions, and not too complicated (or instructions provided, which would be awesome!), of other tools. I would really appreciate if someone can help me overcome this issue or provide instructions. Thank you.

If you are a newbie and you want to create a web crawler for data extraction then I would recommend selenium however, selenium webdriver is slower than scrapy (a python framework for coding web crawlers)

As you have been advised to use selenium, I will only focus on selenium using python.

For your first issue : "How to grab such information from this site"

Suppose the website from which you want to extract data is www.fundsupermart.co.in (selected this to show how to handle new window pop ups)

using selenium you can crawl by writing:

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.fundsupermart.co.in/main/fundinfo/mutualfund-AXIS-BANKING-DEBT-FUND--GROWTH-AXS0077.html')

This will open the firefox browser webdriver and will load the page of the link provided in the get() method

Now suppose if you want to extract a table then you can extract by using its tag_name, xpath or class_name by using functions provided by selenium. Like here if I want to extract table under "Investment Objective" : 在此处输入图片说明

Then for this I will:

right click -> inspect element -> find the appropriate tag from console -> right click -> copy xpath

Here I found that <tbody> tag was the one from which I can extract the table so I right clicked that and clicked on copy xpath so I got the xpath of that tag ie :

xpath=/html/body/table/tbody/tr[2]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/tbody/tr[1]/td/font/table/tbody/tr[1]/td/table/tbody/tr[5]/td/table/tbody

then, in the code add this line:

driver.find_element_by_xpath(xpath).text

Similarly you can extract other data from any website also see selenium's doc here

For you second issue : "How can I scrape the information in this additional window?"

For clicking the link you can use click() function provided by selenium. Suppose here I want to click the link : Click here for price history then I will get the xpath(as done previously) and add line :

driver.find_element_by_xpath(xpath).click()

I will open a new window like this :

在此处输入图片说明

Now to extract data from new window you will have to switch to new window which you can do by adding this line:

windows = driver.window_handles
driver.switch_to_window(windows[1])

Now, by doing this I have switched the webdriver to the new window and now I can extract data as I did earlier and to close this window and switch back to original window just add :

driver.close()
driver.switch_to_window(windows[0])

This was a very basic and naive approach of web crawlers using selenium. The tutorial given here is really good and will help you a lot.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM