[英]Excel VBA: Extract Image Src attribute from HTML as string
I am trying to scrape my employers website to extract images from their Blog post en mass.我正在尝试抓取我的雇主网站以从他们的博客文章中提取图片。 I have started creating a scraping tool in Excel using VBA.我已经开始使用 VBA 在 Excel 中创建一个抓取工具。
(We don't have access to the SQL database) (我们无权访问 SQL 数据库)
I have setup a work sheet that contains a list of post identifiers in column A and the URL of the post in column B.我已经设置了一个工作表,其中包含 A 列中的帖子标识符列表和 B 列中的帖子 URL。
My VBA script so far runs through the list of URL's in column B extracts the HTML from a Tag on the page by ID, using getElementById and pastes the resulting output as a string into column C.到目前为止,我的 VBA 脚本遍历 B 列中的 URL 列表,使用 getElementById 从页面上的标签中提取 HTML,并将结果输出作为字符串粘贴到 C 列中。
I am now at the point where I am trying to figure out how to extract the src attribute from every image in the resulting output and paste it into the relevant columns.我现在正试图弄清楚如何从结果输出中的每个图像中提取 src 属性并将其粘贴到相关列中。 I can't for the life of me come up with an easy solution.我一辈子都无法想出一个简单的解决方案。 I am not very familiar with RegEx and am struggling with Excel's built in string functions.我对 RegEx 不是很熟悉,并且正在努力使用 Excel 的内置字符串函数。
The end game is to get the macro to run through each image URL and save the image to disk with a filename format like "{Event No.}-{Image Number}".jpg最后的游戏是让宏运行每个图像 URL 并将图像以文件名格式保存到磁盘,如“{Event No.}-{Image Number}”.jpg
Any help would be much appreciated.任何帮助将不胜感激。
Sub Get_Image_SRC()
Dim sht As Worksheet
Dim LastRow As Long
Dim i As Integer
Dim url As String
Dim IE As Object
Dim objElement As Object
Dim objCollection As Object
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set sht = ThisWorkbook.Worksheets("Sheet1")
'Ctrl + Shift + End
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
For i = 2 To LastRow
url = Cells(i, "C").Value
MsgBox (url)
IE.navigate url
Application.StatusBar = url & " is loading..."
Do While IE.readyState = 4: DoEvents: Loop
Do Until IE.readyState = 4: DoEvents: Loop
Application.StatusBar = url & " Loaded"
If Cells(i, "B").Value = "WEBNEWS" Then
Cells(i, "D").Value = IE.document.getElementById("NewsDetail").outerHTML
Else
Cells(i, "D").Value = IE.document.getElementById("ReviewContainer").outerHTML
End If
Next i
Set IE = Nothing
Set objElement = Nothing
Set objCollection = Nothing
End Sub
Example resulting HTML:示例结果 HTML:
<div id=""NewsDetail""><div class=""NewsDetailTitle"">Video: Race Face Behind the Scenes Tour</div><div class=""NewsDetailImage""><img alt=""HeadlinesThumbnail.jpg"" src=""/ImageHandler/6190/515/1000/0/""></div> <div class=""NewsDetailBody"">Pinkbike posted this video a while ago, if you missed it, its' definitely worth a watch.
Ken from Camp of Champions took a look at their New Westminster factory last year which gives a look at the production, people and culture of Race Face. The staff at Race Face are truly their greatest asset they had, best wishes to everyone!
<p><center><object width=""500"" height=""281""><param name=""allowFullScreen"" value=""true""><param name=""AllowScriptAccess"" value=""always""><param name=""movie"" value=""http://www.pinkbike.com/v/188244""><embed width=""500"" height=""281"" src=""http://www.pinkbike.com/v/188244"" type=""application/x-shockwave-flash"" allowscriptaccess=""always"" allowfullscreen=""true""></object></center><p></p>
</div><div class=""NewsDate"">Published Friday, 25 November 2011</div></div>"
当您可以使用 Wget轻松完成此操作时,使用 VBA 似乎非常复杂: How do I use Wget to download all Images into a single Folder 。
For the regular expression method you should check out these two links:对于正则表达式方法,您应该查看以下两个链接:
Which basically boils down to this:这基本上归结为:
src
attribute value from img
is src\\s*=\\s*"(.+?)"
从img
获取src
属性值的正则表达式是src\\s*=\\s*"(.+?)"
VBScript.RegExp
library to use regular expressions in VBA使用VBScript.RegExp
库在 VBA 中使用正则表达式I've used late binding but you can include the reference if you want.我使用了后期绑定,但如果需要,您可以包含参考。
Then the VBA goes like this:然后VBA是这样的:
Option Explicit选项显式
Sub Test()子测试()
Dim strHtml As String
' sample html, note single img tag
strHtml = ""
strHtml = strHtml & "<div id=""foo"">"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""fred"" src=""\\server\path\picture1.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""ned"" src=""\\server\path\picture2.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""teddy"" src=""\\server\path\picture3.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "</div>"
Dim strSrc As String
Dim objRegex As Object
Dim objMatches As Object
Dim lngMatchCount As Long, lngCounter As Long
' create regex
Set objRegex = CreateObject("VBScript.RegExp")
' set pattern and execute
With objRegex
.IgnoreCase = True
.Pattern = "src\s*=\s*""(.+?)"""
.Global = True
If .Test(strHtml) Then
Set objMatches = .Execute(strHtml)
lngMatchCount = objMatches.Count
For lngCounter = 0 To lngMatchCount - 1
strSrc = objMatches(lngCounter).SubMatches(0)
' youve successfully captured the img src value
Debug.Print strSrc
Next
Else
strSrc = "Not found"
End If
End With
End Sub结束子
Note that I am getting the first item of the SubMatches
collection in order to get the value of the src
attribute.请注意,我正在获取SubMatches
集合的第一项以获取src
属性的值。 The difference between objMatches(0)
and objMatches(0).SubMatches(0)
in this code is:这段代码中objMatches(0)
和objMatches(0).SubMatches(0)
是:
src="\\server\path\picture.png"
Versus:对比:
\\server\path\picture.png
You probably want to wrap this up as a function and call it when you get work out the value of IE.document.getElementById("NewsDetail").outerHTML
in the If..End If
block of your code.您可能希望将其包装为一个函数,并在您计算出If..End If
代码块中IE.document.getElementById("NewsDetail").outerHTML
的值时调用它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.