繁体   English   中英

使用 VBA 抓取网页

[英]Web scraping using VBA

我想从这个URL 中提取数据。

我想从 10 张名片中的每一张中提取标题、移动电话号码和地址。

在此处输入图片说明

这是我尝试过但没有成功的一些代码。

Public Sub GetValueFromBrowser()
    On Error Resume Next
    Dim Sn As Integer
    Dim ie As Object
    Dim url As String
    Dim Doc As HTMLDocument
    Dim element As IHTMLElement
    Dim elements As IHTMLElementCollection

    For Sn = 1 To 1

        url = Sheets("Infos").Range("C" & Sn).Value

        Set ie = CreateObject("InternetExplorer.Application")

        With ie
            .Visible = 0
            .navigate url
            While .Busy Or .readyState <> 4
                DoEvents
            Wend
        End With    

        Set Doc = ie.document
        Set elements = Doc.getElementsByClassName(" col-sm-5 col-xs-8 store-details sp-detail paddingR0")

        Dim count As Long
        Dim erow As Long
        count = 0
        For Each element In elements
            If element.className = "lng_cont_name" Then
                erow = Sheet1.Cells(Rows.count, 1).End(xlUp).Offset(1, 0).Row
                Cells(erow, 1) = Doc.getElementsByClassName("Store-Name")(count).innerText
                Cells(erow, 2) = Doc.getElementsByClassName("cont_fl_addr")(count).innerText
                count = count + 1
            End If
        Next element

        If Val(Left(Sn, 2)) = 99 Then
            ActiveWorkbook.Save
        End If

    Next Sn
End Sub

电话号码并不容易,因为我认为它们是故意使难以刮掉的。 我找到了一种从 CSS 伪 ::before 元素内容中破译值的方法。 地址和标题是直接的 CSS 选择。


从那以后,我在这里用 python 编写了一个更清晰的脚本。


那么,代码的各个部分是如何工作的呢?

标题:

Set titles = .querySelectorAll(".jcn [title]")

我将标题定位为具有title属性和父jcn类属性的元素。 "." 表示一个类选择器, "[]"是一个属性选择器,中间的" "是一个后代组合器。

在此处输入图片说明

document querySelectorAll方法返回页面上所有匹配元素的nodeList ,即 10 个标题。


地址:

Set addresses = .querySelectorAll(".desk-add.jaddt")

地址由它们的类属性desk-add jaddt 由于不允许使用复合类名,因此需要额外的"." 必须替换名称中的空格。

在此处输入图片说明


电话号码(通过在storesTextToDecipher解密内容):

Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")

这就是魔法发生的地方。 这些数字不能直接通过 DOM 获得,因为它们是伪元素内容。

如果您检查相关的 HTML,您会发现一系列伪::before 元素 VBA 没有公开任何机制来应用伪选择器来尝试在页面的 CSS 中获取此信息。

您看到的实际上是一系列 span 元素,每个元素都有一个以mobilesv开头的 class 属性。 这些元素位于类col-sm-5 col-xs-8 store-details sp-detail paddingR0的单个父元素中(再次注意复合类名称)。

我最初收集所有父元素的nodeList

返回元素示例:

这些父元素中的每一个都包含构成电话号码字符串字符的类名(以mobilesv开头)元素。 一些字符是字符串中的数字,其他字符代表+()- 注意:类名中的 2|3 个字母字符串,在icon-之后 - 例如dc , fe

例如,页面上的第一个搜索结果,对于电话号码中的初始数字9

在此处输入图片说明

这个伪元素 /telephone 字符的实际 CSS 内容可以在 CSS 样式中观察到:

在此处输入图片说明

注意类名和伪元素选择器之前: .icon-ji:before并且内容是\\9d010

长话短说....您可以提取icon-后的 2 或 3 个字母 - 在这种情况下为ji ,在\\9d0\\9d0数字字符串,在这种情况下为10 ,并使用这两位信息来解密电话号码. 此信息在响应中可用:

请参阅与左侧电话字符串的类名相关联的相同 2/3 字母字符串,以及右侧的内容说明。 一点数学推导出右边的数字比电话号码大1,对于那个班级,如网页图片所示。 我只是创建了一个字典,然后通过解析 html 的这一部分将 2/3 字母缩写映射到电话号码。

在遍历storesTextToDecipher ,我使用这本字典从类名中匹配的 2/3 字母缩写中解密实际电话号码。


VBA:

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As HTMLDocument
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
  
    Dim cipherKey As String, cipherDict As Object
    Set cipherDict = CreateObject("Scripting.Dictionary")
    cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
    cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
    
    Dim arr() As String, tempArr() As String, i As Long, j As Long
    arr = Split(cipherKey, """}.icon-")
    For i = LBound(arr) To UBound(arr)
        tempArr = Split(arr(i), Chr$(32))
        cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
    Next

    html.body.innerHTML = sResponse
    Dim titles As Object, addresses As Object, storesTextToDecipher As Object
    With html
        Set titles = .querySelectorAll(".jcn [title]")
        Set addresses = .querySelectorAll(".desk-add.jaddt")
        Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
    End With
    
    For i = 0 To titles.Length - 1
        Debug.Print "title: " & titles.item(i).innerText
        Debug.Print "address: " & addresses.item(i).innerText
        Debug.Print GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
    Next
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
    Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
    Set html2 = New HTMLDocument
    html2.body.innerHTML = storeInfo.innerHTML
    Set elems = html2.querySelectorAll("b span")
    For j = 0 To elems.Length - 1
        On Error Resume Next
        If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
            telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
        End If
        On Error GoTo 0
    Next
   GetStoreNumber = telNumber
End Function

示例输出:

在此处输入图片说明


编辑:所有页面结果

由于您现在需要超过 10 个,因此以下使用预期的页面结果计数 ( NUMBER_RESULTS_ON_PAGE)从页面收集信息。 它滚动页面,直到找到预期数量的电话号码(应该是唯一的),或者命中MAX_WAIT_SEC 这意味着您可以避免无限循环,如果您期望不同的数字,可以设置您的预期结果计数。 它确实依赖于每个商店都列出了电话号码——这似乎是一个相当合理的假设。

Option Explicit
Public Sub GetInfo()
    Dim IE As New InternetExplorer, resultCountDict As Object, cipherDict As Object, t As Date
    Const MAX_WAIT_SEC As Long = 300 'wait 5 minutes max before exiting loop to get all results
    Const NUMBER_RESULTS_ON_PAGE As Long = 80
    Const URL = "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3"
    
    Dim titles As Object, addresses As Object, storesTextToDecipher As Object
    
    Application.ScreenUpdating = True
    
    Set resultCountDict = CreateObject("Scripting.Dictionary")
    Set cipherDict = GetCipherDict(URL)
    
    With IE
        .Visible = True
        .Navigate2 URL
       
        While .Busy Or .readyState < 4: DoEvents: Wend
        
        With .document
            t = Timer
            Do
                DoEvents
                Set titles = .querySelectorAll(".jcn [title]")
                Set addresses = .querySelectorAll(".desk-add.jaddt")
                Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
                Dim telNumber As String, i As Long
                       
                For i = 0 To titles.Length - 1
                    telNumber = GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
                    If Not resultCountDict.Exists(telNumber) Then
                        resultCountDict.Add telNumber, Array(titles.item(i).innerText, addresses.item(i).innerText, telNumber)
                    End If
                Next
            
                .parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
                
                While IE.Busy Or IE.readyState < 4: DoEvents: Wend
                If Timer - t > MAX_WAIT_SEC Then Exit Do
            Loop Until resultCountDict.Count = NUMBER_RESULTS_ON_PAGE

        End With
        .Quit
    End With
    
    Dim key As Variant, rowCounter As Long
    rowCounter = 1
    With ThisWorkbook.Worksheets("Sheet1")
        For Each key In resultCountDict.keys
            .Cells(rowCounter, 1).Resize(1, 3) = resultCountDict(key)
            rowCounter = rowCounter + 1
        Next
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
    Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
    Set html2 = New HTMLDocument
    html2.body.innerHTML = storeInfo.innerHTML
    Set elems = html2.querySelectorAll("b span")
    For j = 0 To elems.Length - 1
        On Error Resume Next
        If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
            telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
        End If
        On Error GoTo 0
    Next
   GetStoreNumber = telNumber
End Function

Public Function GetCipherDict(ByVal URL As String) As Object
    Dim sResponse As String, html As HTMLDocument
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    Dim cipherKey As String, cipherDict As Object
    Set cipherDict = CreateObject("Scripting.Dictionary")
    cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
    cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))

    Dim arr() As String, tempArr() As String, i As Long, j As Long
    arr = Split(cipherKey, """}.icon-")
    For i = LBound(arr) To UBound(arr)
        tempArr = Split(arr(i), Chr$(32))
        cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
    Next
    Set GetCipherDict = cipherDict
End Function

编辑:

顶部存在多个数字的版本(请注意,如果您发出太多请求或太快,服务器将为您提供随机页面):

Option Explicit

Public Sub GetDetails()
    Dim re As Object, decodeDict As Object, i As Long
    Dim html As MSHTML.htmlDocument, responseText As String, keys(), values()
    
    Set decodeDict = CreateObject("Scripting.Dictionary")
    Set re = CreateObject("vbscript.regexp")
    Set html = New MSHTML.htmlDocument
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.justdial.com/chengalpattu/Oasis-Pharma-Near-Saraswathi-Children-School-Revathypuram-Urapakkam/9999PXX44-XX44-181123145524-X8G7_BZDET", False
        .setRequestHeader "User-Agent", "Mozilla/4.0"
        .send
        responseText = .responseText
        html.body.innerHTML = responseText
    End With
    
    keys = GetMatches(re, responseText, "-(\w+):before")

    If UBound(keys) = 0 Then Exit Sub
    
    values = GetMatches(re, responseText, "9d0(\d+)", True)
   
    For i = LBound(values) To UBound(values)
        decodeDict(keys(i)) = values(i)
    Next
    
    Dim itemsToDecode()
    
    decodeDict(keys(UBound(keys))) = "+"

    itemsToDecode = GetValuesToDecode(html)
    
    PrintNumbers re, html, itemsToDecode, decodeDict
End Sub

Public Function GetMatches(ByVal re As Object, ByVal inputString As String, ByVal sPattern As String, Optional ByVal numeric = False, Optional ByVal spanSearch = False) As Variant
    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long

    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern

        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                If numeric Then
                    arrMatches(i) = iMatch.SubMatches.item(0) - 1
                Else
                    If spanSearch Then
                        arrMatches(i) = iMatch
                    Else
                        arrMatches(i) = iMatch.SubMatches.item(0)
                    End If
                End If
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

Public Function GetValuesToDecode(ByVal html As MSHTML.htmlDocument) As Variant
    Dim i As Long, elements As Object, results(), Class As String

    Set elements = html.querySelectorAll(".telCntct span[class*='icon']")
    
    ReDim results(elements.Length - 1)
    For i = 0 To elements.Length - 1
        Class = elements.item(i).className
        results(i) = Right$(Class, Len(Class) - InStrRev(Class, "-"))
    Next
    GetValuesToDecode = results
End Function

Public Sub PrintNumbers(ByVal re As Object, ByVal html As htmlDocument, ByVal itemsToDecode As Variant, ByVal decodeDict As Object)
    Dim output As String, i As Long

    For i = LBound(itemsToDecode) To UBound(itemsToDecode)
        output = output & decodeDict(itemsToDecode(i))
    Next
    
    Dim htmlToSearch As String, groups As Variant, startPos As Long, oldStartPos As Long
    
    htmlToSearch = html.querySelector(".telCntct").outerHTML

    groups = GetMatches(re, htmlToSearch, "mobilesv|,", False, True)
 
    startPos = 1
    
    Dim totalNumbers As Long
    
    For i = LBound(groups) To UBound(groups)
        If InStr(groups(i), ",") > 0 Then
            totalNumbers = totalNumbers + 1
            Debug.Print Mid$(output, startPos, IIf(startPos = 1, i, i - startPos))
            startPos = i + 1
        End If
    Next
    If totalNumbers = 1 Then Debug.Print Right$(output, Len(output) - startPos - 1)
End Sub

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM