简体   繁体   中英

Scraping data from website using vba

我试图从网站上抓取数据: http : //uk.investing.com/rates-bonds/financial-futures通过 vba,比如实时价格,即德国 5 年波布尔,美国 30 年国债,我尝试过 excel网页查询但它只抓取整个网站,但我只想抓取速率,有没有办法做到这一点?

There are several ways of doing this. This is an answer that I write hoping that all the basics of Internet Explorer automation will be found when browsing for the keywords "scraping data from website", but remember that nothing's worth as your own research (if you don't want to stick to pre-written codes that you're not able to customize).

Please note that this is one way , that I don't prefer in terms of performance (since it depends on the browser speed) but that is good to understand the rationale behind Internet automation.

1) If I need to browse the web, I need a browser! So I create an Internet Explorer browser:

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")

2) I ask the browser to browse the target webpage. Through the use of the property ".Visible", I decide if I want to see the browser doing its job or not. When building the code is nice to have Visible = True , but when the code is working for scraping data is nice not to see it everytime so Visible = False .

With appIE
    .Navigate "http://uk.investing.com/rates-bonds/financial-futures"
    .Visible = True
End With

3) The webpage will need some time to load. So, I will wait meanwhile it's busy...

Do While appIE.Busy
    DoEvents
Loop

4) Well, now the page is loaded. Let's say that I want to scrape the change of the US30Y T-Bond: What I will do is just clicking F12 on Internet Explorer to see the webpage's code, and hence using the pointer (in red circle) I will click on the element that I want to scrape to see how can I reach my purpose.

在此处输入图片说明

5) What I should do is straight-forward. First of all, I will get by the ID property the tr element which is containing the value:

Set allRowOfData = appIE.document.getElementById("pair_8907")

Here I will get a collection of td elements (specifically, tr is a row of data, and the td are its cells. We are looking for the 8th, so I will write:

Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML

Why did I write 7 instead of 8? Because the collections of cells starts from 0, so the index of the 8th element is 7 (8-1). Shortly analysing this line of code:

  • .Cells() makes me access the td elements;
  • innerHTML is the property of the cell containing the value we look for.

Once we have our value, which is now stored into the myValue variable, we can just close the IE browser and releasing the memory by setting it to Nothing:

appIE.Quit
Set appIE = Nothing

Well, now you have your value and you can do whatever you want with it: put it into a cell ( Range("A1").Value = myValue ), or into a label of a form ( Me.label1.Text = myValue ).

I'd just like to point you out that this is not how StackOverflow works: here you post questions about specific coding problems, but you should make your own search first. The reason why I'm answering a question which is not showing too much research effort is just that I see it asked several times and, back to the time when I learned how to do this, I remember that I would have liked having some better support to get started with. So I hope that this answer, which is just a "study input" and not at all the best/most complete solution, can be a support for next user having your same problem. Because I have learned how to program thanks to this community, and I like to think that you and other beginners might use my input to discover the beautiful world of programming.

Enjoy your practice ;)

Other methods were mentioned so let us please acknowledge that, at the time of writing, we are in the 21st century. Let's park the local bus browser opening, and fly with an XMLHTTP GET request (XHR GET for short).

Wiki moment:

XHR is an API in the form of an object whose methods transfer data between a web browser and a web server. The object is provided by the browser's JavaScript environment

It's a fast method for retrieving data that doesn't require opening a browser. The server response can be read into an HTMLDocument and the process of grabbing the table continued from there.

Note that javascript rendered/dynamically added content will not be retrieved as there is no javascript engine running (which there is in a browser).

In the below code, the table is grabbed by its id cr1 .

桌子

In the helper sub, WriteTable , we loop the columns ( td tags) and then the table rows ( tr tags), and finally traverse the length of each table row, table cell by table cell. As we only want data from columns 1 and 8, a Select Case statement is used specify what is written out to the sheet.


Sample webpage view:

示例页面视图


Sample code output:

代码输出


VBA:

Option Explicit
Public Sub GetRates()
    Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
    
    Set html = New HTMLDocument
      
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
        .send
        html.body.innerHTML = .responseText
    End With
    
    Application.ScreenUpdating = False
    
    Set hTable = html.getElementById("cr1")
    WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
    
    Application.ScreenUpdating = True
End Sub

Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
    r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
    With ws
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            Select Case columnCounter
            Case 2
                .Cells(startRow, 1) = header.innerText
            Case 8
                .Cells(startRow, 2) = header.innerText
            End Select
        Next header
        startRow = startRow + 1
        Set tBody = hTable.getElementsByTagName("tbody")
        For Each tSection In tBody
            Set tRow = tSection.getElementsByTagName("tr")
            For Each tr In tRow
                r = r + 1
                Set tCell = tr.getElementsByTagName("td")
                C = 1
                For Each td In tCell
                    Select Case C
                    Case 2
                        .Cells(r, 1).Value = td.innerText
                    Case 8
                        .Cells(r, 2).Value = td.innerText
                    End Select
                    C = C + 1
                Next td
            Next tr
        Next tSection
    End With
End Sub

您可以使用 winhttprequest 对象而不是 Internet Explorer,因为加载不包括图片和广告的数据是很好的,而不是下载包括广告和图片在内的完整网页,这些图片使 Internet Explorer 对象与 winhttpRequest 对象相比很重。

This question asked long before. But I thought following information will useful for newbies. Actually you can easily get the values from class name like this.

Sub ExtractLastValue()

Set objIE = CreateObject("InternetExplorer.Application")

objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600

objIE.Visible = True

objIE.Navigate ("https://uk.investing.com/rates-bonds/financial-futures/")

Do
DoEvents
Loop Until objIE.readystate = 4

MsgBox objIE.document.getElementsByClassName("pid-8907-last")(0).innerText

End Sub

And if you are new to web scraping please read this blog post.

Web Scraping - Basics

And also there are various techniques to extract data from web pages. This article explain few of them with examples.

Web Scraping - Collecting Data From a Webpage

I modified some thing that were poping up error for me and end up with this which worked great to extract the data as I needed:

Sub get_data_web()

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")

With appIE
    .navigate "https://finance.yahoo.com/quote/NQ%3DF/futures?p=NQ%3DF"
    .Visible = True
End With

Do While appIE.Busy
    DoEvents
Loop

Set allRowofData = appIE.document.getElementsByClassName("Ta(end) BdT Bdc($c-fuji-grey-c) H(36px)")

Dim i As Long
Dim myValue As String

Count = 1

    For Each itm In allRowofData

        For i = 0 To 4

        myValue = itm.Cells(i).innerText
        ActiveSheet.Cells(Count, i + 1).Value = myValue

        Next

        Count = Count + 1

    Next

appIE.Quit
Set appIE = Nothing


End Sub

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM