简体   繁体   中英

#N/A value when scraping HTML with Excel VBA

I'm having an issue with extracting HTML-code from .HTM game reports. As you can see I'm splitting the HTML file using Split() and it usually creates around 60,000 rows for this particular play-by-play report.

Now, for 99 games out of 100 this works perfectly fine, but every now and then (such as with this exact GameID) it for whatever reason starts putting out #N/A after just a few thousand rows - very early into the document.

I've checked the .HTM in question and it looks just like any other document that I've extracted.

Here's the code snippet:

Dim ie As Object, doc As HTMLDocument
Dim Y1 As String, Y2 As String, GameID As String
Dim SourceCode As Worksheet, c As Range

Set ie = New InternetExplorer
Set SourceCode = Worksheets("Source Code")    
Y1 = "2017"
Y2 = "2018"
GameID = "0003"

ie.navigate "http://www.nhl.com/scores/htmlreports/" & Y1 & Y2 & "/PL02" & GameID & ".HTM"

Do
    DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE

Set doc = ie.document

arr = Split(doc.DocumentElement.innerHTML, vbLf)

SourceCode.Range("A1").Resize(UBound(arr) + 1, 1).Value = Application.Transpose(arr)

And this is what happens (occasionally):

NHL Game Play By Play HTML Extract

Does anyone have a suggestion for how to error handle this? I haven't got the slightest clue what could be causing it.

There's an undocumented limit on the Transpose function at 65,536 items. There's some discussion here on MSDN and you can google 'transpose vba limit' and see some other threads.

So your files are coming in at around 60,000 rows and some may go over the 65,536 limit and it looks like you hit that limit with that GameID.

One workaround, is to create a second array eg

Dim arr2() As String <-- we will make this a 2d array and fill it from arr

'... later in the code

arr = Split(doc.DocumentElement.innerHTML, vbLf) '<-- get the array with your original code
ReDim Preserve arr2(1 To UBound(arr), 1 To 1) '<-- redim arr2 to be suitable for writing to a range

' now fill the array from arr - basically doing a "manual" transpose
Dim i As Long
For i = 1 To UBound(arr)
    arr2(i, 1) = arr(i)
Next i

Full working code:

Option Explicit

Sub Foo()

    Dim ie As Object, doc As HTMLDocument
    Dim Y1 As String, Y2 As String, GameID As String
    Dim SourceCode As Worksheet, c As Range
    Dim arr As Variant, arr2() As String

    Set ie = New InternetExplorer
    Set SourceCode = Worksheets("Source Code")
    Y1 = "2017"
    Y2 = "2018"
    GameID = "0003"

    ie.navigate "http://www.nhl.com/scores/htmlreports/" & Y1 & Y2 & "/PL02" & GameID & ".HTM"

    Do
        DoEvents
    Loop Until ie.readyState = READYSTATE_COMPLETE

    Set doc = ie.document

    arr = Split(doc.DocumentElement.innerHTML, vbLf)
    ReDim Preserve arr2(1 To UBound(arr), 1 To 1)
    Dim i As Long
    For i = 1 To UBound(arr)
        arr2(i, 1) = arr(i)
    Next i

    SourceCode.Range("A1").Resize(UBound(arr), 1).Value = arr2

    ' see comment by ashleedawg on the other answer
    Set ie = nothing

End Sub

It might be easier to use Excel's built-in web query tools. (If you haven't used the feature before it might seem more complicated than it is.) Click the DATA menu, then "From Web", enter the site, and Excel will download all the organized data it can find there. The attached image shows the steps and results, and here is a link to more about Excel web queries .

示例图像

To automate, you could record a macro of the steps to import a file, and then edit the macro as VBA to enter your custom dates, etc.

If you're somewhat comfortable with MS Access, it is better suited to scraping; I have a file I could pass on which I've used for scraping everything from Google to eBay (before I learned about API's!)


EDIT:

Oh! One of your problems: Your code doesn't set ie = nothing when it's done, so each time you run the code , Excel opens another copy of Internet Explorer (hidden in the background). CTRL-ALT-DEL to see the Task Manager and I bet you have multiple copies of iexplore.exe running, sucking up all your memory, andcausing the crashes. Reboot to fix that and try again, or try another method (after rebooting).

This is another way to do it, more memory-friendly:

Sub nhl_test()

    Dim ie As Object, doc As HTMLDocument
    Dim Y1 As String, Y2 As String, GameID As String, html As String
    Dim SourceCode As Worksheet, c As Range, y As Long, x As Long

    Set ie = New InternetExplorer
    Set SourceCode = Worksheets("Source Code")
    Y1 = "2017"
    Y2 = "2018"
    GameID = "0003"

    ie.navigate "http://www.nhl.com/scores/htmlreports/" & Y1 & Y2 & "/PL02" & GameID & ".HTM"

    Do
        DoEvents
    Loop Until ie.readyState = READYSTATE_COMPLETE

    Set doc = ie.document
    html = doc.DocumentElement.innerHTML 'put HTML into a variable
    Set doc = Nothing 'free up memory
    Set ie = Nothing 'free up memory, close IE

    arr = Split(html, vbLf) 'now split the html into lines

    For x = 1 To UBound(arr) ' x counts the line # in the file
        If Trim(arr(x)) <> "" Then
            y = y + 1 ' y counts the line # that isn't blank
            SourceCode.Cells(y, 1).Value = arr(x) 'put line x into row y
        End If
    Next x
    MsgBox "done (" & y & "rows)"
End Sub

...but it's still not a very organized set of data. What's your plan once you have the html loaded into a worksheet?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM