简体   繁体   中英

Web scraping dynamic content with Python (dynamic HTML/Javascript table)

I would like to scrape data from a dynamic HTML table where some data require a button to be clicked in order to be loaded (with Javascript). The data I am interested in are on this webpage and so far, I have only managed to scrape the data loaded by default.

On the webpage linked previously, I am trying to extract the data contained in the table named "Fundamental" ( picture showing what I am trying to scrape ).

So far, I coded this:

import pandas as pd
import requests as rq
from bs4 import BeautifulSoup

headers = {"user-agent": "chrome"}
url = "https://www.investing.com/indices/stoxx-600-components"
htmlcontent = rq.get(url, headers=headers).text
soup = BeautifulSoup(htmlcontent, "lxml")

table_price = soup.find("table", {"id": "cr1"})

indexcomponents = []

rows = table_price.find_all("tr")

for row in rows[1:]:
    columns = row.find_all("td")
    indexcomponents.append([
        columns[1].text,
        columns[2].text,
        columns[6].text,
        columns[7].text,
        columns[8].text])

for n in range(len(indexcomponents)):
    print(indexcomponents[n])

I am very well aware that similar questions have already been asked, but I am quite a beginner in Python and know absolutely nothing about Javascript and as a consequence, I haven't succeed in implementing the proposed solutions. Thanks in advance for the help!

Here is the working solution:

import pandas as pd
import requests
url = "https://www.investing.com/indices/stoxx-600-components"
r = requests.get(url, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
data = pd.read_html(r.text,attrs={"id ":"cr1"})

for i in data:
    print(i)   

#i.to_csv('investing.csv', index = False) #To store data as csv

OUTPUT:

  Unnamed: 0                    Name      Last  ...     Vol.      Time  Unnamed: 9
0           NaN                3I Group  1287.000  ...    1.32M  11:35:00         NaN
1           NaN                     A2A     1.815  ...   14.14M  11:35:34         NaN
2           NaN                     AAK   203.000  ...  272.67K  11:29:34         NaN
3           NaN     Aalberts Industries    50.620  ...  230.09K  11:39:08         NaN
4           NaN                     ABB    33.500  ...    1.90M  11:31:00         NaN
..          ...                     ...       ...  ...      ...       ...         ...
584         NaN            Worldline SA    78.960  ...  766.23K  11:37:33         NaN
585         NaN                     WPP   933.000  ...    1.98M  11:38:00         NaN
586         NaN      Yara International   467.800  ...  322.33K  10:25:01         NaN   
587         NaN              Zalando SE    97.900  ...  466.31K  11:35:00         NaN   
588         NaN  Zurich Insurance Group   367.400  ...  207.07K  11:34:00         NaN   

[589 rows x 10 columns]

Selenium with Scrapy for your desired working solution:

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from time import sleep


class TableSpider(scrapy.Spider):
    name = 'table'
     
    allowed_domains = ['www.investing.com'] 
    start_urls = [
        'https://www.investing.com/indices/stoxx-600-components'
    ]

    def __init__(self):
        chrome_options = Options()
        #chrome_options.add_argument("--headless")

        chrome_path = which("chromedriver")

        self.driver = webdriver.Chrome(executable_path=chrome_path)#, options=chrome_options)
        self.driver.set_window_size(1920, 1080)
        self.driver.get("https://www.investing.com/indices/stoxx-600-components")
        sleep(5)
        rur_tab = self.driver.find_element_by_id("filter_fundamental")
        rur_tab.click()
        sleep(5)

        self.html = self.driver.page_source
        self.driver.close()
    
        def parse(self, response):
            resp = Selector(text=self.html)
            for tr in resp.xpath('(//tbody)[2]/tr'):
                yield {
                    'Average Vol': tr.xpath(".//td[3]/text()").get(),
                    'Market Cap': tr.xpath(".//td[4]/text()").get()
                    
                }

OUTPUT: A portion of total output:

    {'Average Vol': '1.31M', 'Market Cap': '7.19B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '950.47K', 'Market Cap': '18.44B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '921.90K', 'Market Cap': '5.82B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '375.59K', 'Market Cap': '5.39B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '191.61K', 'Market Cap': '5.76B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '62.44K', 'Market Cap': '10.52B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '1.31M', 'Market Cap': '15.13B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '163.85K', 'Market Cap': '29.76B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '2.79M', 'Market Cap': '233.86B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '146.01K', 'Market Cap': '2.30B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '201.49K', 'Market Cap': '8.18B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '911.90K', 'Market Cap': '50.36B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '1.92M', 'Market Cap': '2.91B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '2.28M', 'Market Cap': '28.28B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '81.20M', 'Market Cap': '32.06B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '313.25K', 'Market Cap': '6.59B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '1.04M', 'Market Cap': '102.35B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '4.09M', 'Market Cap': '414.52B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '6.21K', 'Market Cap': '32.33B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '303.02K', 'Market Cap': '4.57B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '190.13K', 'Market Cap': '6.61B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '1.40M', 'Market Cap': '7.49B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '495.52K', 'Market Cap': '4.93B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '40.93K', 'Market Cap': '4.94B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '665.41K', 'Market Cap': '9.98B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '459.73K', 'Market Cap': '2.18B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '522.84K', 'Market Cap': '6.19B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '237.73K', 'Market Cap': '3.80B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '465.56K', 'Market Cap': '24.44B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '495.88K', 'Market Cap': '22.04B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '2.13M', 'Market Cap': '11.15B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '478.85K', 'Market Cap': '119.16B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '825.97K', 'Market Cap': '25.40B'}
2021-07-29 11:23:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.investing.com/indices/stoxx-600-components>
{'Average Vol': '371.43K', 'Market Cap': '54.56B'}
2021-07-29 11:23:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-29 11:23:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 326,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 134730,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 5.182406,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 7, 29, 5, 23, 2, 200527),
 'httpcompression/response_bytes': 911212,
 'httpcompression/response_count': 1,
 'item_scraped_count': 589

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM