简体   繁体   English

使用 VBA 将文本从网站拉入 Excel

[英]Pulling text from website into Excel by Using VBA

I am slowly exploring if I can use VBA to code a macro that will search a website from a list of keywords/codes in column A and extract the data.我正在慢慢探索是否可以使用 VBA 编写一个宏,该宏将从 A 列中的关键字/代码列表中搜索网站并提取数据。 Currently The code below searches the desired website using the range in ("A1") only but does get to the right page with the data I wish to extract.目前,下面的代码仅使用(“A1”)中的范围搜索所需的网站,但确实使用我希望提取的数据到达正确的页面。 In this case the Code in a1 is 100-52-7在这种情况下,a1 中的代码是100-52-7

Sub BrowseToSite()

Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument

IE.Visible = True
IE.Navigate "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"

Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop

IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$txtSearch").Value = Range("a1").Value
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click

Set HTMLDoc = IE.Document
'Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText


End Sub

Now I wish to pull the "0-5 mg/kg bw (1996)" phrase on this page into Excel.现在我希望将本页上的“0-5 mg/kg bw (1996)”短语提取到 Excel 中。 I planned to do this by retriving the inner text within the class name however I run into an error Object Variable or With Block variable not set with the following line:我计划通过检索 class 名称中的内部文本来执行此操作,但是我遇到了错误Object Variable or With Block variable not set

Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

You can get rid of IE altogether and try using xmlhttp requests to make the script robust.您可以完全摆脱 IE 并尝试使用 xmlhttp 请求来使脚本健壮。 What the following script does is send a get http requests first to scrape the value of certain parameters supposed to be used within post requests and then issue a post requests to parse the desired content.以下脚本的作用是首先发送一个 get http 请求,以抓取应该在 post 请求中使用的某些参数的值,然后发出 post 请求以解析所需的内容。

This is one of the efficient ways how you can:这是一种有效的方法,您可以:

Option Explicit
Public Sub GetContent()
    Const Url = "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"
    Dim oHttp As Object, oHtml As HTMLDocument, MyDict As Object
    Dim DictKey As Variant, payload$, searchKeyword$
    
    Set oHtml = New HTMLDocument
    Set oHttp = CreateObject("MSXML2.XMLHTTP")
    Set MyDict = CreateObject("Scripting.Dictionary")
    
    'send get requests first to parse the value of "__VIEWSTATE", "__VIEWSTATEGENERATOR" e.t.c., as in oHtml.getElementById("__VIEWSTATE").Value
    
    With oHttp
        .Open "GET", Url, False
        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
        .send
        oHtml.body.innerHTML = .responseText
    End With
    
    searchKeyword = "100-52-7" 'this is the search keyword you wanna use from your predefined search terms
    
    'MyDict stores keys and values within dictionary, as in __VIEWSTATE = "some value" and so on
    
    MyDict("__VIEWSTATE") = oHtml.getElementById("__VIEWSTATE").Value
    MyDict("__VIEWSTATEGENERATOR") = oHtml.getElementById("__VIEWSTATEGENERATOR").Value
    MyDict("__EVENTVALIDATION") = oHtml.getElementById("__EVENTVALIDATION").Value
    MyDict("ctl00$ContentPlaceHolder1$txtSearch") = searchKeyword
    MyDict("ctl00$ContentPlaceHolder1$btnSearch") = "Search"
    MyDict("ctl00$ContentPlaceHolder1$txtSearchFEMA") = ""

    'joining each set of key and value with ampersand to make it a string so that you can use it as a parameter while issuing post requests, which is what payload is doing
    
    payload = ""
    For Each DictKey In MyDict
        payload = IIf(Len(DictKey) = 0, WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)), _
        payload & "&" & WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)))
    Next DictKey
    
    With oHttp
        .Open "POST", Url, False
        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
        .setRequestHeader "Content-type", "application/x-www-form-urlencoded"
        .send (payload)
        oHtml.body.innerHTML = .responseText
    End With
    
    MsgBox oHtml.querySelector("#SearchResultItem > a").NextSibling.NodeValue
    
End Sub

Make sure to add the following libraries to execute the above script:确保添加以下库来执行上述脚本:

Microsoft XML, v6.0
Microsoft Scripting Runtime
Microsoft HTML Object Library

You click on an element with this line of code:你用这行代码点击一个元素:

IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click

for which IE makes a POST request to retrieve your results, as can be seen here: IE 发出 POST 请求以检索您的结果,如下所示:

在此处输入图像描述 The above is a screen shot from Edge's dev tools, but concept is the same以上是Edge开发工具的截图,但概念是一样的

During this request, the element in question is not immediately there, so you will need to wait for it to load.在此请求期间,相关元素不会立即出现,因此您需要等待它加载。

Your prior method of你之前的方法

Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop

would probably work, but I find it to be inconsistent at times and would also include checking the .Busy property as well.可能会起作用,但我发现它有时会不一致,并且还包括检查.Busy属性。

Try using this after your click:点击后尝试使用它:

IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click

'~~WAIT FOR SEARCH RESULTS TO LOAD~~
Do While IE.ReadyState < READYSTATE_COMPLETE Or IE.Busy
Loop

Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

If you're still having issues, you can force IE to wait for the element in question to become available by doing this:如果您仍然遇到问题,您可以通过执行以下操作强制 IE 等待相关元素变为可用:

On Error Resume Next
Do while HTMLDoc.getElementsByClassName("sectionHead1")(0) is Nothing
Loop
On Error Goto 0

Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

This is a simple loop that checks for the object, and will continue to loop until that object is no longer Nothing (which essentially means it has loaded).这是一个简单的循环,用于检查 object,并将继续循环直到 object 不再是Nothing (这实际上意味着它已加载)。

And I would recommend that you add some sort of timeout that may trigger an error or something just in case the webpage is having issues so you're not in an infinite loop.我建议您添加某种可能触发错误的超时,以防万一网页出现问题,这样您就不会陷入无限循环。

Pro Tip:专家提示:

If you are clicking the search button a lot of times and waiting for a lot of objects to load, instead of duplicating the above code you can turn it into it's own sub and do something like:如果您多次单击搜索按钮并等待大量对象加载,则无需复制上述代码,您可以将其转换为自己的子代码并执行以下操作:

 Sub WaitForElement(IE as InternetExplorer, elem As Object) Do While IE.ReadyState < 4 Or IE.Busy: Loop On Error Resume Next Do While elem is Nothing: Loop On error Goto 0 End Sub

Then you would just need to use the following line after each click:然后,您只需在每次单击后使用以下行:

 WaitForElement IE, HTMLDoc.getElementsByClassName("sectionHead1")(0)

Not only would this cut down on the number of lines in your code, it could greatly improve readability as well.这不仅会减少代码中的行数,还可以大大提高可读性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM