简体   繁体   English

Puppeteer / Excel VBA:如何从网站上抓取表格数据

[英]Puppeteer / Excel VBA: how to scrape table data from website

I have used Excel VBA & IE to access website financial data tables for 15 years, but this is now obsolete.我使用 Excel VBA 和 IE 访问网站财务数据表已有 15 年,但现在已经过时了。 So I am trying to use Puppeteer on a Raspberry Pi to dump the website data to disk which Excel VBA on Win10 can then pick up and process.因此,我试图在 Raspberry Pi 上使用 Puppeteer 将网站数据转储到磁盘,然后 Win10 上的 Excel VBA 可以提取和处理该数据。 Getting the page data is easy enough, but how do I convert the page data, retrieved via page.content(), into a useful format?获取页面数据很容易,但是如何将通过 page.content() 检索的页面数据转换为有用的格式? I'm very new to Puppeteer HTML etc..我对 Puppeteer HTML 等很陌生。

 const browser = await puppeteer.launch({headless: false});
 const page = await browser.newPage();
 await page.goto('https://www.morningstar.co.uk/uk');
 await page.setViewport({width: 1000, height: 1000})
 wait page.goto (portfolio URL);
 const fs = require('fs');
 fs.writeFileSync('ms.txt', await page.content());

The relevant data in ms.txt looks like this: ms.txt 中的相关数据如下所示:

<table cellspacing="0" border="0" id="ctl00_ctl00_MainContent_PM_MainContent_gv_Portfolio" style="width:100%;border-collapse:collapse;">
    <tbody><tr class="gridHeader">
        <th class="gridHeaderText" scope="col"><img class="GridArrow" src="../../includes/images/arrow_asc_small.gif" align="middle" style="border-width:0px;"> <a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$SecurityName')">Holding</a></th><th class="gridHeaderText" scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$StarRatingM255')">Morningstar<br>Rating</a></th><th class="gridHeaderNumeric" scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$ClosePrice')">Current<br>Price</a></th><th class="gridHeaderText" scope="col">&nbsp;</th><th class="gridHeaderNumeric" scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$ReturnD1')">Price<br>Change<br>%</a></th><th class="gridHeaderNumeric" scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$Weight')">Weight<br>%</a></th><th class="gridHeaderNumeric" scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$ClosePriceDate')">Date</a></th>

And the table elements:和表格元素:

</tr><tr class="gridItem">
<td class="msDataText" style="width:375px;"><a href="/uk/stockquicktake/default.aspx?id=0P0000M5OM">Burford Capital Ltd</a></td><td class="msDataText" style="width:95px;"><span>Not Rated</span></td><td title="09/12/2022" class="msDataNumeric" style="width:75px;">7.1300</td><td class="msDataText" style="width:20px;">GBP</td><td class="msDataNumeric" style="width:60px;">0.56</td><td class="msDataNumeric" style="width:70px;">0.0000</td><td class="msDataNumeric" style="width:80px;">09/12/2022</td>
</tr><tr class="gridAlternateItem">
<td class="msDataText" style="width:375px;"><a href="/uk/stockquicktake/default.aspx?id=0P00007YPZ">Centrica PLC</a></td><td class="msDataText" style="width:95px;"><img src="../../includes/images/5stars.gif" style="border-width:0px;"></td><td title="09/12/2022" class="msDataNumeric" style="width:75px;">0.9224</td><td class="msDataText" style="width:20px;">GBP</td><td class="msDataNumeric" style="color:Red;width:60px;">-0.30</td><td class="msDataNumeric" style="width:70px;">0.0000</td><td class="msDataNumeric" style="width:80px;">09/12/2022</td>
</tr><tr class="gridItem">

I have a slight preference for deciphering the web data in Puppeteer, but doing it in VBA is also fine if it is simpler.我稍微偏向于在 Puppeteer 中解密 Web 数据,但如果更简单,在 VBA 中也可以。 I do want it to be simple and reliable, and to work for another 15 years so I want to avoid unusual 3rd party add-ins.我确实希望它简单可靠,并能再使用 15 年,所以我想避免不寻常的第 3 方插件。 Using Selenium seems complicated and over the top for my purposes, although I am not wedded to Puppeteer if there is a simpler method.使用 Selenium 似乎很复杂,而且就我的目的而言过于复杂,但如果有更简单的方法,我不会执着于 Puppeteer。

I agree puppeteer is simple and reliable.我同意 puppeteer 简单可靠。

This is demo page这是演示页面

在此处输入图像描述

Find selector查找选择器

if you click F12 in Chrome, you can see the HTML source code.如果在 Chrome 中单击 F12,则可以看到 HTML 源代码。

And select Element tab and click this icon然后选择Element选项卡并单击此图标在此处输入图像描述

Then hove your mouse on the table, You can see the match and class name, id string and xpath of tag.然后将鼠标悬停在表格上,可以看到标签的匹配和类名、id字符串和xpath。

在此处输入图像描述

I will scrap Interim Result and Income Statement table by selector of querySelectorAll我将通过querySelectorAll的选择器废弃Interim ResultIncome Statement

First table第一桌

<div id="FinancialsInterimPrelimResultsHemscott" class="box financialsInterimPrelimResultsHemscott">
    <table class="right years2">
        <caption class="sectionHeader">Interim Results </caption>
        <colgroup>
...
        </colgroup>
        <thead>
            <tr>
                <th class="colDataPoint" id="MsStockReportFiprhDp" scope="col"></th>
                <th scope="col" class="number" id="MsStockReportFiprhY1">30/06/2021</th>
                <th scope="col" class="number" id="MsStockReportFiprhY2">30/06/2022</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <th headers="MsStockReportFiprhDp" scope="row">Turnover</th>
                <td class="number" headers="MsStockReportFiprhY1">64.23</td>
                <td class="number" headers="MsStockReportFiprhY2">65.41</td>
            </tr>
...
        </tbody>
    </table>
</div>

It's tree of tag is它的标签树是

<div id="FinancialsInterimPrelimResultsHemscott">
  <table>
     <tbody>
        <tr>
          <td>64.23</td>

The tree of tag level and id can find 64.23 data tag level和id的树可以找到64.23条数据

[id='FinancialsInterimPrelimResultsHemscott'] table tbody tr td

If call如果打电话

document.querySelectorAll("[id='FinancialsInterimPrelimResultsHemscott'] table tbody tr td")

Will return all of table data.将返回所有表数据。 That is main Idea.这是主要的想法。

This is demo code这是演示代码

From this code, call getData() function with two parameter first is URL, it will what is looking for Web page second is URL, it will what is looking for table by ID.从这段代码中,调用getData() 函数有两个参数,第一个是URL,它会查找网页,第二个是URL,它会根据ID 查找表格。

const puppeteer = require("puppeteer");

async function getData(url, table_id) {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);

        titles = await page.evaluate((table_id) => {
            const selector = `[id=\'${table_id}\'] table tbody tr td`
            return Array.from(document.querySelectorAll(selector),
                title => title.innerText.trim());
        }, table_id);

        await browser.close();
        return Promise.resolve(titles);
    } catch (error) {
        return Promise.reject(error);
    }
}

url = 'https://tools.morningstar.co.uk/uk/stockreport/default.aspx?tab=10&vw=fs&SecurityToken=0P0000M5OM%5D3%5D0%5DE0WWE%24%24ALL&Id=0P0000M5OM&ClientFund=0&CurrencyId=BAS'

getData(url,'FinancialsInterimPrelimResultsHemscott')
    .then((titles) => {
        console.log("FinancialsInterimPrelimResultsHemscott");
        console.log(titles);
    })

getData(url,'FinancialsIncomeStatementSummaryHemscott')
    .then((titles) => {
        console.log("FinancialsIncomeStatementSummaryHemscott");
        console.log(titles);
    })

Running and Result运行结果

First result is Income Statement Second table result is Interim Result The order is flipped due to async/await not guarantee is order call That is why I log the table ID in code.第一个结果是Income Statement第二个表结果是Interim Result订单由于异步/等待而翻转不能保证是订单调用这就是我在代码中记录表 ID 的原因。

$ node get-data.js
FinancialsIncomeStatementSummaryHemscott
[
  '319.11', '386.44', '326.42', '316.68',
  '-12.76', '-',      '-',      '-',
  '-',      '-',      '-',      '-',
  '-',      '-',      '-',      '-',
  '-',      '-',      '-',      '-',
  '249.18', '305.11', '194.21', '201.72',
  '-59.44', '249.30', '317.58', '180.80',
  '164.78', '-56.43', '249.30', '317.58',
  '180.80', '164.78', '-72.07', '229.46',
  '293.00', '152.37', '164.78', '-113.12',
  '1.20',   '1.50',   '0.82',   '0.75',
  '0.00'
]
FinancialsInterimPrelimResultsHemscott
[
  '64.23',  '65.41',
  '-20.41', '11.22',
  '-0.13',  '-0.10',
  '0.05',   '0.06'
]

Next Steps下一步

You may needs those items您可能需要这些物品

#1 What page's table want to scrap - it is hard question, I recommend Scrapy in Python. #1 哪个页面的表格想要废弃——这是一个很难的问题,我推荐使用 Python 的Scrapy

#2 Add table column and row title - you can see this answer #2 添加表格列和行标题 - 你可以看到这个答案

#4 may table convert JSON to CSV format for VBA - it will help in here #4 可以将 JSON 表转换为 VBA 的 CSV 格式——它会在这里有所帮助

#5 Needs to save a file with table(or page) name and table data - you needs to investigate or googling. #5 需要保存包含表(或页面)名称和表数据的文件 - 您需要调查或谷歌搜索。

  • Note - how to copy HTML to clipboard then paste your code editor注意 - 如何将 HTML 复制到剪贴板然后粘贴代码编辑器在此处输入图像描述

Real Demo with Questioner HTML data.带有提问者 HTML 数据的真实演示。

Code代码

const puppeteer = require("puppeteer");

async function getData(url,header_class,  row_class) {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);

        titles = await page.evaluate((header_class, row_class) => {
            // table header
            // <tr class="gridHeader">
            // 'Holding\tMorningstar\nRating\tCurrent\nPrice\t \tPrice\nChange\n%\tWeight\n%\tDate'
            const headerSelector = `tr[class=\'${header_class}\']`
            headers = Array.from(document.querySelectorAll(headerSelector), row => row.innerText.trim());

            // table rows data
            // <tr class="gridItem">
            //   '3i Infrastructure Ord\tNot Rated\t3.2900\tGBP\t-0.30\t0.0000\t14/12/2022',
            const rowSelector = `tr[class=\'${row_class}\']`
            rows = Array.from(document.querySelectorAll(rowSelector), row => row.innerText.trim());

            // add header at the begin from rows data
            rows = headers.concat(rows)
            return rows.map((row) => {
                return row.split('\t');
            });
        }, header_class, row_class);

        await browser.close();
        return Promise.resolve(titles);
    } catch (error) {
        return Promise.reject(error);
    }
}

// modify your target URL
url = 'http://127.0.0.1:5500/MS-Portfolio.html'

getData(url,'gridHeader', 'gridItem')
    .then((rows) => {
        console.log(rows);
        console.log(rows.length);
    })

Result结果

$ node get-table.js
[
  [
    'Holding',
    'Morningstar\nRating',
    'Current\nPrice',
    ' ',
    'Price\nChange\n%',
    'Weight\n%',
    'Date'
  ],
  [
    '3i Infrastructure Ord',
    'Not Rated',
    '3.2900',
    'GBP',
    '-0.30',
    '0.0000',
    '14/12/2022'
  ],
  [
    'abrdn Asia Pacific Equity I Acc',
    '',
    '3.5291',
    'GBP',
    '0.28',
    '16.6667',
    '14/12/2022'
  ],
  [
    'Amati AIM VCT Ord',
    'Not Rated',
    '1.2650',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [ 'Aviva PLC', '', '4.4740', 'GBP', '0.16', '0.0000', '14/12/2022' ],
  [
    'Baronsmead Second Venture Trust Ord',
    'Not Rated',
    '0.6300',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [ 'Basf SE', '', '47.2300', 'EUR', '-0.63', '0.0000', '14/12/2022' ],
  [
    'BH Macro USD Ord',
    '',
    '46.7000',
    'USD',
    '-0.29',
    '0.0000',
    '14/12/2022'
  ],
  [
    'BlackRock World Mining Trust Ord',
    '',
    '6.8100',
    'GBP',
    '-1.16',
    '0.0000',
    '14/12/2022'
  ],
  [
    'British American Tobacco PLC',
    '',
    '32.6800',
    'GBP',
    '0.48',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Burford Capital Ltd',
    'Not Rated',
    '7.0400',
    'GBP',
    '-2.43',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Diageo PLC',
    '',
    '37.6450',
    'GBP',
    '0.13',
    '0.0000',
    '14/12/2022'
  ],
  [
    'E.ON SE',
    'Not Rated',
    '9.2860',
    'EUR',
    '1.38',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Fidelity Special Values Ord',
    'Not Rated',
    '2.7200',
    'GBP',
    '-0.55',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Gaming Realms PLC',
    'Not Rated',
    '0.2580',
    'GBP',
    '1.47',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Gold Bullion Securities',
    'Not Rated',
    '167.8900',
    'USD',
    '-0.09',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Haleon PLC',
    'Not Rated',
    '3.2110',
    'GBP',
    '0.60',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Henderson Diversified Income Ord',
    '',
    '0.6820',
    'GBP',
    '-0.87',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Herald Ord',
    '',
    '17.7600',
    'GBP',
    '-0.78',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Hotel Chocolat Group PLC',
    'Not Rated',
    '1.4000',
    'GBP',
    '-1.41',
    '0.0000',
    '14/12/2022'
  ],
  [
    'HSBC S&P 500 ETF GBP',
    '',
    '32.9555',
    'GBP',
    '-0.67',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Imperial Brands PLC',
    '',
    '20.3400',
    'GBP',
    '0.69',
    '0.0000',
    '14/12/2022'
  ],
  [
    'IP Group PLC',
    'Not Rated',
    '0.5995',
    'GBP',
    '-1.32',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares $ Treasury Bd 20+y ETF USD Di...',
    '',
    '3.2425',
    'GBP',
    '0.42',
    '16.6667',
    '14/12/2022'
  ],
  [
    'iShares China Large Cap ETF USD Dist GBP',
    '',
    '66.2450',
    'GBP',
    '0.91',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares Core FTSE 100 ETF GBP Dist',
    '',
    '7.3510',
    'GBP',
    '-0.09',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares MSCI Brazil ETF',
    '',
    '26.4700',
    'USD',
    '-0.19',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares MSCI EM ETF USD Dist GBP',
    '',
    '30.4413',
    'GBP',
    '0.80',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares Physical Gold ETC',
    'Not Rated',
    '35.2850',
    'USD',
    '-0.27',
    '0.0000',
    '14/12/2022'
  ],
  [
    'ITV PLC',
    'Not Rated',
    '0.7668',
    'GBP',
    '0.08',
    '0.0000',
    '14/12/2022'
  ],
  [
    'JPMorgan Indian Ord',
    '',
    '8.3600',
    'GBP',
    '-0.24',
    '0.0000',
    '14/12/2022'
  ],
  [
    'MannKind Corp',
    'Not Rated',
    '4.9500',
    'USD',
    '1.15',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Merck & Co Inc',
    '',
    '111.5500',
    'USD',
    '1.16',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Mobeus Income & Growth 4 VCT Ord',
    'Not Rated',
    '0.7650',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [
    'National Grid PLC',
    '',
    '10.2750',
    'GBP',
    '1.18',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Pantheon Infrastructure Ord',
    'Not Rated',
    '0.9440',
    'GBP',
    '0.85',
    '16.6667',
    '14/12/2022'
  ],
  [
    'PepsiCo Inc',
    '',
    '183.3600',
    'USD',
    '-0.35',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Picton Property Income Ltd',
    'Not Rated',
    '0.8230',
    'GBP',
    '-0.60',
    '0.0000',
    '14/12/2022'
  ],
  [
    'RWE AG Class A',
    '',
    '42.7300',
    'EUR',
    '1.05',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Samsung Electronics Co Ltd GDR',
    '',
    '1,163.0000',
    'USD',
    '-0.29',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Schroder UK Mid Cap Ord',
    '',
    '5.4500',
    'GBP',
    '0.18',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Shell PLC',
    '',
    '22.9600',
    'GBP',
    '-0.95',
    '0.0000',
    '14/12/2022'
  ],
  [ 'SSE PLC', '', '17.2550', 'GBP', '1.00', '0.0000', '14/12/2022' ],
  [
    'Standard Chartered PLC',
    '',
    '6.0660',
    'GBP',
    '-0.23',
    '0.0000',
    '14/12/2022'
  ],
  [
    'The Income & Growth VCT Ord',
    'Not Rated',
    '0.7600',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Tritax Big Box Ord',
    'Not Rated',
    '1.5180',
    'GBP',
    '1.81',
    '0.0000',
    '14/12/2022'
  ],
  [
    'UBS(Lux)FS MSCI EMU GBPH Adis',
    'Not Rated',
    '11.1290',
    'GBP',
    '-0.13',
    '0.0000',
    '14/12/2022'
  ],
  [
    'United Utilities Group PLC',
    '',
    '10.3650',
    'GBP',
    '0.24',
    '0.0000',
    '14/12/2022'
  ],
  [
    'VanEck JPM EM LC Bd ETF A USD',
    '',
    '53.7600',
    'USD',
    '-0.18',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Vanguard FTSE 100 UCITS ETF',
    '',
    '32.8900',
    'GBP',
    '-0.09',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Vanguard S&P 500 ETF USD Acc GBP',
    '',
    '60.4600',
    'GBP',
    '-0.67',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Viatris Inc',
    'Not Rated',
    '11.3200',
    'USD',
    '-0.25',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Wells Fargo & Co',
    '',
    '42.1800',
    'USD',
    '-1.11',
    '0.0000',
    '14/12/2022'
  ]
]
53

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM