I would like to extract content from a webpage. However, when I get response text it includes JavaScript, which cannot be processed like a browser-opened page.
Can this method be used to get HTML content or only browser emulation can help? Or maybe there are some different methods of receiving this content?
Dim oXMLHTTP As New MSXML2.XMLHTTP
Dim htmlObj As New HTMLDocument
With oXMLHTTP
.Open "GET", "http://www.manta.com/ic/mtqyfk0/ca/riverbend-holdings-inc", False
.send
If .ReadyState = 4 And .Status = 200 Then
htmlObj.body.innerHTML = .responseText
'do things
End If
End With
Response text:
<!DOCTYPE html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_blocked.html?Ref=/ic/mtq599v/ca/45th-street-limited-partnership&distil_RID=2115B138-A1BF-11E6-A957-C0595F6B962F&distil_TID=20161103121454" />
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script type="text/javascript" src="/ser-yrbwqfedrrwwvctvyavy.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#verxvaxcuczwcwecuxsx{display:none!important}</style></head>
<body>
<div id="distil_ident_block"> </div>
</body>
</html>
No - because the Javascript is actually part of the HTML inside of <script>
tags. You will have to post-process the response to remove the tags yourself.
You can use a function to remove the <script>
nodes from the DOM after you have received the HTML for the page:
Function RemoveScriptTags(objHTML As HTMLDocument) As String
Dim objElement As HTMLGenericElement
For Each objElement In objHTML.all
If VBA.LCase$(objElement.nodeName) = "script" Then
objElement.removeNode
End If
Next objElement
RemoveScriptTags = objHTML.DocumentElement.outerHTML
End Function
This can be included in your sample code like so:
Option Explicit
Sub Test()
Dim objXMLHTTP As New MSXML2.XMLHTTP
Dim objHTML As Object
Dim strUrl As String
Dim strHtmlNoScriptTags As String
strUrl = "http://www.manta.com/ic/mtqyfk0/ca/riverbend-holdings-inc"
With objXMLHTTP
.Open "GET", strUrl, False
.send
If .ReadyState = 4 And .Status = 200 Then
Set objHTML = CreateObject("htmlfile")
objHTML.Open
objHTML.write objXMLHTTP.responseText
objHTML.Close
'do things
strHtmlNoScriptTags = RemoveScriptTags(objHTML)
Debug.Print strHtmlNoScriptTags
'update html document with script-less document
Set objHTML = CreateObject("htmlfile")
objHTML.Open
objHTML.write strHtmlNoScriptTags
objHTML.Close
'you can now operate on DOM of objHTML
End If
End With
End Sub
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.