简体   繁体   中英

All data not getting imported from website to excel

I want to import restaurant data like Restaurant name, phone number & website to excel but unfortunately I am getting only one page (first page) however I want data from any range which I define like page-1 to page-3 or page-2 to page-5 in separate sheets for each page. Sample output file is attached of what output I am getting for the time being. 在此处输入图片说明

    Sub Webscraping()
        'Declaration
     Dim ie As InternetExplorer
     Dim ht As HTMLDocument
    'Initialization
     Set ie = New InternetExplorer
     ie.Visible = True

'Open a url
ie.navigate ("https://www.yellowpages.com/atlanta-ga/restaurants")

'Set ht = ie.document
'MsgBox ht.getElementsByClassName("ot_lrp_bname_free_center")

'Alternative Approach for wait

Do Until ie.readyState = READYSTATE_COMPLETE
    DoEvents
Loop

'Initialize the document

Set ht = ie.document

'Set elems = ht.getElementsByClassName("list-title")
Set elems = ht.getElementsByClassName("business-name")
'Set elems = ht.getElementsByClassName("website-lrp icon-link ot_lrp_website_text_free_center")

i = 1
For Each elem In elems
    Sheet1.Cells(i, 1).Value = elem.innerText
    i = i + 1

    'Debug.Print (elem.innerText)
Next

Set elems = ht.getElementsByClassName("phone primary")

i = 1
For Each elem In elems
    Sheet1.Cells(i, 2).Value = elem.innerText
    i = i + 1

   'Debug.Print (elem.innerText)
Next
Set elems = ht.getElementsByClassName("links")
i = 1
For Each elem In elems

    Set link = elem.ChildNodes.Item(0)
    Sheet1.Cells(i, 3).Value = link.href
    i = i + 1    
Next

'Set internetdata = ie.document
'Set div_result = internetdata.getElementById("ctl00_gvMain_ctl03_hlTitle")
'Set header_links = div_result.getElementsByTagName("a")
'For Each h In header_links
'Set link = h.ChildNodes.Item(0)
'Worksheets("Stocks").Cells(Range("L" & Rows.Count).End(xlUp).Row + 1, 12) = link.href
 End Sub

This is the work which have been done but struggling to get the required ouput

The pages are concatenated onto end of url. I would use xhr issues requests in a loop over given page range and regex out the json which contains the required info (it is in one of the script tags). This method is very fast and more than offsets use of regex. I also re-use objects where possible.

I use jsonconverter.bas to handle the json and parse out required info (there is a lot more info in the json including the reviews). After downloading the .bas and adding to a module called JsonConverter in your project, you need to go VBE > Tools > References > Add reference to Microsoft Scripting Runtime.

Helper functions are used to test whether page to write out to already exists or needs creating, as well as one to write out json results to an array and dump the array out in one go to sheet (efficiency gain). The structure is left so it is easy to extend the retrieved info if more info is desired eg the review.

There may be some work to do on ensuring works for pages that don't exist. I have simply used status code of response at present to filter these out.


NOTES:

As a sanity check I would use InternetExplorer to go to page 1 and extract the total results count. I would divide that by results per page (currently 30) to calculate total pages. This would give me lbound and ubound values (min and max for possible pages). Then switch to xmlhttp to actually retrieve. See additional helper function at end.


Code:

Option Explicit
Public Sub GetRestuarantInfo()
    Dim s As String, re As Object, p As String, page As Long, r As String, json As Object
    Const START_PAGE As Long = 2
    Const END_PAGE As Long = 4
    Const RESULTS_PER_PAGE As Long = 30

    p = "\[{""@context"".*?\]"
    Set re = CreateObject("VBScript.RegExp")

    Application.ScreenUpdating = False

    With CreateObject("MSXML2.XMLHTTP")

        For page = START_PAGE To END_PAGE
            .Open "GET", "https://www.yellowpages.com/atlanta-ga/restaurants?page=" & page, False
            .send
            If .Status = 200 Then
                s = .responseText
                r = GetValue(re, s, p)
                If r <> "Not Found" Then
                    Set json = JsonConverter.ParseJson(r)
                    WriteOutResults page, RESULTS_PER_PAGE, json
                End If
            End If
        Next
    End With
    Application.ScreenUpdating = True
End Sub
Public Sub WriteOutResults(ByVal page As Long, ByVal RESULTS_PER_PAGE As Long, ByVal json As Object)
    Dim sheetName As String, results(), r As Long, headers(), ws As Worksheet
    ReDim results(1 To RESULTS_PER_PAGE, 1 To 3)

    sheetName = "page" & page
    headers = Array("Name", "Website", "Tel")
    If Not WorksheetExists(sheetName) Then
        Set ws = ThisWorkbook.Worksheets.Add
        ws.Name = sheetName
    Else
        ThisWorkbook.Worksheets(sheetName).Cells.ClearContents
    End If
    With ws
        Dim review As Object
        For Each review In json  'collection of dictionaries
            r = r + 1
            results(r, 1) = review("name")
            results(r, 2) = review("url")
            results(r, 3) = review("telephone")
        Next
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetValue(ByVal re As Object, inputString As String, ByVal pattern As String) As String
'https://regex101.com/r/M9oRON/1
    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
        If .Test(inputString) Then
            GetValue = .Execute(inputString)(0)
        Else
            GetValue = "Not found"
        End If
    End With
End Function

Public Function WorksheetExists(ByVal sName As String) As Boolean  '@Rory https://stackoverflow.com/a/28473714/6241235
    WorksheetExists = Evaluate("ISREF('" & sName & "'!A1)")
End Function

Helper function to return number of pages

'VBE > Tools > References: Microsoft Internet Controls
Public Function GetNumberOfPages(ByVal RESULTS_PER_PAGE As Long) As Variant
    Dim ie As Object, totalResults As Long
    On Error GoTo errhand
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .Visible = False
        .Navigate2 "https://www.yellowpages.com/atlanta-ga/restaurants?page=1"

        While .Busy Or .readyState < 4: DoEvents: Wend

        With .document
            totalResults = Replace$(Replace$(.querySelector(".pagination  p").innerText, "We found", vbNullString), "results", vbNullString)
            GetNumberOfPages = totalResults / RESULTS_PER_PAGE
            ie.Quit
            Exit Function
        End With
    End With
errhand:
    If Err.Number <> 0 Then
        GetNumberOfPages = CVErr(xlErrNA)
    End If
End Function

Regex explanation:

Try it here .

在此处输入图片说明

The only way to do this with VBA is to check for the existence of a "Next" button and to click it, if it is there:

在此处输入图片说明

This is the HTML for it:

<a class="next ajax-page" href="/atlanta-ga/restaurants?page=2" data-page="2" data-analytics="{&quot;click_id&quot;:132}" data-remote="true" data-impressed="1">Next</a>

This is not "science fiction" to be done with VBA, however, there are commercial RPA solutions, which give "out of the box" functionalities exactly for this task - UiPath, AutomationAnywhere, BluePrism. Python's "beautiful soup" would do a rather good job as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM