简体   繁体   中英

How to use Excel macro scrape the first column of a Table class nested within Div id

Ok, this is the target webpage(s): http://dnd.arkalseif.info/items/index.html_page=27

Here's my current code:

Sub GetItemsList()
' This macro uses manually entered links to scrap the content of the target page.
' It does not (yet) capture hyperlinks, it only grabs text.
Dim ie As Object
Dim retStr As String
Dim sht As Worksheet
Dim LastRow As Long
Dim rCell As Range
Dim rRng As Range
Dim Count As Long
Dim Status As String
Dim BadCount As Long


Set sht = ThisWorkbook.Worksheets("List")
BadCount = 0

LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
    Set ie = CreateObject("internetexplorer.application")
    Set rRng = sht.Range("b1:b" & LastRow)
    Status = "Starting at row "
    For Each rCell In rRng.Cells
        Count = rCell.Row
        Application.StatusBar = BadCount & " dead links so far. " & Status & Count & "of " & LastRow & "."
        Wait 1
        If rCell = "" Then
            With ie
                .Navigate rCell.Offset(0, -1).Value
                .Visible = False
            End With
            Do While ie.Busy
                DoEvents
            Loop
            Wait 1

            On Error GoTo ErrHandler
'            rCell.Value = ie.Document.getElementById("content").innerText
            rCell.Value = ie.Document.getElementsByClassName("common").innerText
            rCell.WrapText = False
            Status = "This row successfully scraped. Moving on to row "
            Application.StatusBar = BadCount & " dead links so far. " & Status & Count + 1 & "of " & LastRow & "."
            Status = "Previous row succeded. Now at row "
98            Wait 1
        End If
    Next rCell
    If BadCount > 0 Then
        Application.StatusBar = "Macro finshed running with " & BadCount & " errors."
        Else
        Application.StatusBar = "Finished."
    End If
    Exit Sub
ErrHandler:
    rCell.Value = ""
    Status = "Previous row failed. Moving on to row "
    BadCount = BadCount + 1
    Application.StatusBar = "This row is a dead link. " & BadCount & " dead links so far. Moving on to row " & Count + 1 & "of " & LastRow & "."
    Resume 98
End Sub

(try to ignore all my StatusBar updates, this code was originally meant for a looooong list of hyperlinks, and I needed (at the time) to know when things buggered up)

Now, the commented out line works, in that it grabs the entire body of text from the div id Content . But I want to grab the hyperlinks nestled inside the first column of the table which is nested inside the div id (which is what the following line was for). But it just fails. Excel does nothing, treats it like an error, and proceeds to the next link.

I presume that I need to tell Excel to look for the Table class inside the Div id . But I don't know how to do that, and I haven't been able to figure it out.

Thanks everyone.

I would use CSS selectors to target the links and XMLHTTP as a faster retrieval method than launching a browser.


CSS selectors:

The following:

td:first-child [href]

The td:first-child is a :first-child CSS pseudo-class selector of td tagged element; " " is a descendant combinator selector , the [] is an attribute selector . Basically, it selects for the first td element in each row in this case ie the first column, and then to the href attribute element within.

The :first-child CSS pseudo-class represents the first element among a group of sibling elements.

Sadly VBA implementation doesn't support the :not selector as the exact elements could also be matched with .common tr + tr td :not([href*='rule'],br) . Support for pseudo selectors is very limited. In this case using a :nth-child() CSS pseudo-class selector of td:nth-child(1) would have retrieved specific items if supported in descendant combination as td:nth-child(1) [href] . I keep meaning to do a write up on what is and isn't supported in case anyone wants as a reference. It is useful to be aware of even non VBA supported methods in case you then chose to switch to a language that does support.

The selector is applied via the querySelectorAll method of, in this case, HTMLDocument . It returns all matches as a nodeList whose .Length can be traversed to access individual matched elements by index.

nodeList items:

在此处输入图片说明


Option Explicit
Public Sub GetLinks()
    Dim sResponse As String, html As HTMLDocument, nodeList As Object, i As Long
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://dnd.arkalseif.info/items/index.html_page=27", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    Set html = New HTMLDocument
    With html
        .body.innerHTML = sResponse
        Set nodeList = .querySelectorAll("td:first-child [href]")
        For i = 0 To nodeList.Length - 1
            Debug.Print Replace$(nodeList.item(i), "about:", "http://dnd.arkalseif.info/items/")
        Next
    End With
End Sub

References (VBE > Tools > References):

  1. Microsoft HTML Object Library

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM