簡體   English   中英

刮HTML的VBA宏提取了一些錯誤的元素

[英]VBA Macro that scrapes HTML pulls some wrong elements

我在刮一些HTML時遇到問題。

這是我的macro在其中抓取的URL ,下面是代碼摘錄:

Set els = IE.Document.getelementsbytagname("a")
    For Each el In els
        If Trim(el.innertext) = "Documents" Then
            colDocLinks.Add el.href
        End If
    Next el

如您所見,如果您打開該URL則會遇到搜索結果; 那么宏查找所有links在搜索表,並把它們放在一個Collection命名colDocLinks

但是,搜索結果在他們的表中有我要包括的10-Q文檔,但它們也有不同種類的動物,我想像10-Q/A文檔一樣包括...


我如何修改循環,使其明確地 增加 10-Q的什么也沒有重視他們的集合中,而不是像其他10-Q / A的?

Public WithEvents objIE As InternetExplorer


Sub LaunchIE()
Set objIE = New InternetExplorer

objIE.Visible = True
objIE.Navigate "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=icld&type=10-Q%20&dateb=&owner=exclude&count=20"

End Sub

Private Sub objIE_DocumentComplete(ByVal pDisp As Object, URL As Variant)

Dim localIE As InternetExplorer
Set localIE = pDisp

Dim doc As MSHTML.IHTMLDocument3
Set doc = localIE.Document

Dim tdElements As MSHTML.IHTMLElementCollection
Dim td As MSHTML.IHTMLElement
Set tdElements = doc.getElementsByTagName("td")
For Each td In tdElements

    If td.innerText = "10-Q" Then

        Dim tr As MSHTML.IHTMLElement
        Set tr = td.parentElement

        Dim childrenElements As MSHTML.IHTMLElementCollection
        Dim child As MSHTML.IHTMLElement
        Set childrenElements = tr.Children
        For Each child In childrenElements
            If child.innerText = " Documents" Then
                'Handle found element
            End If
        Next

    End If

Next

End Sub

我將使用正則表達式來查找和提取我正在尋找的確切鏈接。 像這樣:

Dim RegEx As RegExp
Set RegEx = New RegExp
Dim match As match

With RegEx
    .IgnoreCase = True
    .Global = True
    .MultiLine = True
End With

RegEx.Pattern = "<td nowrap="nowrap">10-Q</td>.+?<a href=""(.+?)\.htm"">"

For Each match In RegEx.Execute(Selection)
    colDocLinks.Add match
Next

我沒有測試上面的正則表達式,因此可能需要一些調整。 您需要包括對Microsoft VBScript正則表達式5.5的引用才能起作用。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM