简体   繁体   中英

I'm having trouble scraping this

I'm trying to understand why my references arent working well to scrape this data.

Here is the site as an example:
http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear

And as a target:

<div id="data_i6" class="rf_crow"><div id="Y_1" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="741131269">741</div><div id="Y_2" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="836611464">837</div><div id="Y_3" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="939841654">940</div><div id="Y_4" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="1010472512">1,010</div><div id="Y_5" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="1100344312">1,100</div><div id="Y_6" class="pos column6Width_noChart116px" style="overflow:hidden;white-space: nowrap;" rawvalue="1115401551">1,115</div></div>

What I need to extract is the actual value in rawvalue="741131269" and the following is what I've gotten to work so far.

'Cells(1, 1) = Document.getElementsByClassName("rf_crow")'returns the rows of data into one cell
'Cells(1, 1) = Document.getElementById("Y_1").innerText 'returns the text for the year
'Cells(1, 1) = Document.getElementById("data_i1").innerText 'returns to first row of data

I know the above doesn't return what I want, because the comment tells you what it extracts into Excel. The sub-element doesn't seem to work as it does in other macros I've built. I thought something like this would work:

Cells(1, 1) = Document.getElementById("Y_1").getAttribute("rawvalue")

but that doesn't work, also, I tried:

Cells(1, 1) = Document.getElementById("data_i6").getElementById("Y_1").innertext

and that doesn't work either.

The solution is very easy. Just call it using it's attribute which is `rawvalue.

This is how you can go:

Using Hardcoded delay and for loop to check the availability of the desired value:

Sub GetValue()
    Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object, elem As Object

    With IE
        .Visible = True
        .Navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"
        While .Busy = True Or .ReadyState < 4: DoEvents: Wend
        Set HTML = .Document
    End With

    ''using hardcoded delay
    Application.Wait Now + TimeValue("00:00:05")

    For Each elem In HTML.getElementsByTagName("div")
        If elem.innerText = "741" Then MsgBox elem.getAttribute("rawvalue"): Exit For
    Next elem
End Sub

Using Explicit Wait :

Sub GetValue()
    Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object

    With IE
        .Visible = True
        .Navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"
        While .Busy = True Or .ReadyState < 4: DoEvents: Wend
        Set HTML = .Document
    End With

    Do: Set post = HTML.querySelector("#data_i6 #Y_1"): DoEvents: Loop While post Is Nothing

    MsgBox post.getAttribute("rawvalue")
End Sub

Output at this moment:

741131269

Try and declare "objCollection" as an object, strValue as string, and in the code below, replace in the first line the name of the http-request you declared:

Document.body.innerHTML = YourHTTPRequest.responseText
Set objCollection = Document.getElementsByClassName("rf_crow")

 For Each objElement In objCollection

     If objElement.ID = "Y_1" Then
         strValue = objElement.getAttribute("rawvalue")
         Exit For
     End If

 Next

 Cells(1, 1) = strValue

The following should illuminate some of the problems you were having.


.querySelectorAll

The exact element you mention is the second index returned by .querySelectorAll method of .document using the CSS selector #Y_1 . The # means Id.

From that webpage it returns the following (sample shown - not all):

CSS查询

From the above you can see the string you want is returned by the index of 2 in the result.

querySelectorAll with Id? Isn't Id a unique identifier for a single element?

This Id, unexpectedly, is not unique to a single element on the page. It occurs a whopping 27 times:

ID

This means you can use the .querySelectorAll method to return a nodeList of all matching items and take the item at index 2 to get your result.

Note:

If you want the long number next to rawvalue , 741131269 , then parse the outerHTML of the returned element.

Debug.Print Replace(Split(Split(a.item(2).outerHTML, "rawvalue=")(1), ">")(0), Chr$(34), vbNullString)

.querySelector

Alternatively, you can target the id which is specific data_i6 with

.document.querySelector("#data_i6")

This CSS selector (#data_i6) returns the entire row as it has each year within. If using .querySelector you will only get the first item anyway which is year 1.

所有年份的ID

You can be more specific with the CSS selector and add the additional year Id to get just the year of interest:

#data_i6 #Y_1

感兴趣年份


Code: (querySelector method commented out next to querySelectorAll)

Option Explicit
Public Sub Get_Information()
    Dim IE As New InternetExplorer

    With IE
        .Visible = True
        .navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"
        While .Busy = True Or .readyState < 4: DoEvents: Wend

        Dim a As Object, exitTime As Date
        exitTime = Now + TimeSerial(0, 0, 2)

        Do
            DoEvents
            On Error Resume Next
            Set a = .document.querySelectorAll("#Y_1") ' .document.querySelector("#data_i6 #Y_1")
            On Error GoTo 0
            If Now > exitTime Then Exit Do
        Loop While a Is Nothing

        If a Is Nothing Then Exit Sub

        Debug.Print Split(Split(a.item(2).innerText, "rawvalue=")(0), ">")(0) 'Split(Split(a.innerText, "rawvalue=")(0), ">")(0)
        Debug.Print Replace(Split(Split(a.item(2).outerHTML, "rawvalue=")(1), ">")(0), Chr$(34), vbNullString) 'Replace(Split(Split(a.outerHTML, "rawvalue=")(1), ">")(0), Chr$(34), vbNullString)

        .Quit
    End With
End Sub

Does this work for you?

Sub web_table_option_two()
    Dim HTMLDoc As New HTMLDocument
    Dim objTable As Object
    Dim lRow As Long
    Dim lngTable As Long
    Dim lngRow As Long
    Dim lngCol As Long
    Dim ActRw As Long
    Dim objIE As InternetExplorer
    Set objIE = New InternetExplorer
    objIE.Navigate "http://quote.morningstar.ca/Quicktakes/Financials/is.aspx?t=GNTX&region=USA&culture=en-CA&ops=clear"

    Do Until objIE.ReadyState = 4 And Not objIE.Busy
        DoEvents
    Loop
    Application.Wait (Now + TimeValue("0:00:03")) 'wait for java script to load
    HTMLDoc.body.innerHTML = objIE.Document.body.innerHTML
    With HTMLDoc.body
        Set objTable = .getElementsByTagName("table")
        For lngTable = 0 To objTable.Length - 1
            For lngRow = 0 To objTable(lngTable).Rows.Length - 1
                For lngCol = 0 To objTable(lngTable).Rows(lngRow).Cells.Length - 1
                    ThisWorkbook.Sheets("Sheet1").Cells(ActRw + lngRow + 1, lngCol + 1) = objTable(lngTable).Rows(lngRow).Cells(lngCol).innerText
                Next lngCol
            Next lngRow
            ActRw = ActRw + objTable(lngTable).Rows.Length + 1
        Next lngTable
    End With
    objIE.Quit
End Sub

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM