简体   繁体   English

Excel VBA:从 HTML 中提取 Image Src 属性作为字符串

[英]Excel VBA: Extract Image Src attribute from HTML as string

I am trying to scrape my employers website to extract images from their Blog post en mass.我正在尝试抓取我的雇主网站以从他们的博客文章中提取图片。 I have started creating a scraping tool in Excel using VBA.我已经开始使用 VBA 在 Excel 中创建一个抓取工具。

(We don't have access to the SQL database) (我们无权访问 SQL 数据库)

I have setup a work sheet that contains a list of post identifiers in column A and the URL of the post in column B.我已经设置了一个工作表,其中包含 A 列中的帖子标识符列表和 B 列中的帖子 URL。

My VBA script so far runs through the list of URL's in column B extracts the HTML from a Tag on the page by ID, using getElementById and pastes the resulting output as a string into column C.到目前为止,我的 VBA 脚本遍历 B 列中的 URL 列表,使用 getElementById 从页面上的标签中提取 HTML,并将结果输出作为字符串粘贴到 C 列中。

I am now at the point where I am trying to figure out how to extract the src attribute from every image in the resulting output and paste it into the relevant columns.我现在正试图弄清楚如何从结果输出中的每个图像中提取 src 属性并将其粘贴到相关列中。 I can't for the life of me come up with an easy solution.我一辈子都无法想出一个简单的解决方案。 I am not very familiar with RegEx and am struggling with Excel's built in string functions.我对 RegEx 不是很熟悉,并且正在努力使用 Excel 的内置字符串函数。

The end game is to get the macro to run through each image URL and save the image to disk with a filename format like "{Event No.}-{Image Number}".jpg最后的游戏是让宏运行每个图像 URL 并将图像以文件名格式保存到磁盘,如“{Event No.}-{Image Number}”.jpg

Any help would be much appreciated.任何帮助将不胜感激。

Worksheet setup工作表设置

Sub Get_Image_SRC()

Dim sht As Worksheet
Dim LastRow As Long
Dim i As Integer
Dim url As String
Dim IE As Object
Dim objElement As Object
Dim objCollection As Object
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement


Set sht = ThisWorkbook.Worksheets("Sheet1")
'Ctrl + Shift + End
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
For i = 2 To LastRow
    url = Cells(i, "C").Value
    MsgBox (url)
    IE.navigate url
    Application.StatusBar = url & " is loading..."
    Do While IE.readyState = 4: DoEvents: Loop
    Do Until IE.readyState = 4: DoEvents: Loop
    Application.StatusBar = url & " Loaded"
    If Cells(i, "B").Value = "WEBNEWS" Then
        Cells(i, "D").Value = IE.document.getElementById("NewsDetail").outerHTML
       Else
        Cells(i, "D").Value = IE.document.getElementById("ReviewContainer").outerHTML
    End If



Next i

Set IE = Nothing
Set objElement = Nothing
Set objCollection = Nothing

End Sub

Example resulting HTML:示例结果 HTML:

<div id=""NewsDetail""><div class=""NewsDetailTitle"">Video: Race Face Behind the Scenes Tour</div><div class=""NewsDetailImage""><img alt=""HeadlinesThumbnail.jpg"" src=""/ImageHandler/6190/515/1000/0/""></div>    <div class=""NewsDetailBody"">Pinkbike posted this video a while ago, if you missed it, its' definitely worth a watch. 

Ken from Camp of Champions took a look at their New Westminster factory last year which gives a look at the production, people and culture of Race Face. The staff at Race Face are truly their greatest asset they had, best wishes to everyone!

<p><center><object width=""500"" height=""281""><param name=""allowFullScreen"" value=""true""><param name=""AllowScriptAccess"" value=""always""><param name=""movie"" value=""http://www.pinkbike.com/v/188244""><embed width=""500"" height=""281"" src=""http://www.pinkbike.com/v/188244"" type=""application/x-shockwave-flash"" allowscriptaccess=""always"" allowfullscreen=""true""></object></center><p></p>


</div><div class=""NewsDate"">Published Friday, 25 November 2011</div></div>"

My current references我目前的参考

当您可以使用 Wget轻松完成此操作时,使用 VBA 似乎非常复杂: How do I use Wget to download all Images into a single Folder

For the regular expression method you should check out these two links:对于正则表达式方法,您应该查看以下两个链接:

Which basically boils down to this:这基本上归结为:

  • Regular expression to get a src attribute value from img is src\\s*=\\s*"(.+?)"img获取src属性值的正则表达式是src\\s*=\\s*"(.+?)"
  • Use the VBScript.RegExp library to use regular expressions in VBA使用VBScript.RegExp库在 VBA 中使用正则表达式

I've used late binding but you can include the reference if you want.我使用了后期绑定,但如果需要,您可以包含参考。

Then the VBA goes like this:然后VBA是这样的:

Option Explicit选项显式

Sub Test()子测试()

Dim strHtml As String

' sample html, note single img tag
strHtml = ""
strHtml = strHtml & "<div id=""foo"">"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""fred"" src=""\\server\path\picture1.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""ned"" src=""\\server\path\picture2.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""teddy"" src=""\\server\path\picture3.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "</div>"

Dim strSrc As String
Dim objRegex As Object
Dim objMatches As Object
Dim lngMatchCount As Long, lngCounter As Long

' create regex
Set objRegex = CreateObject("VBScript.RegExp")

' set pattern and execute
With objRegex
    .IgnoreCase = True
    .Pattern = "src\s*=\s*""(.+?)"""
    .Global = True

    If .Test(strHtml) Then
        Set objMatches = .Execute(strHtml)
        lngMatchCount = objMatches.Count
        For lngCounter = 0 To lngMatchCount - 1
            strSrc = objMatches(lngCounter).SubMatches(0)
            ' youve successfully captured the img src value
            Debug.Print strSrc
        Next
    Else
        strSrc = "Not found"
    End If
End With

End Sub结束子

Note that I am getting the first item of the SubMatches collection in order to get the value of the src attribute.请注意,我正在获取SubMatches集合的第一项以获取src属性的值。 The difference between objMatches(0) and objMatches(0).SubMatches(0) in this code is:这段代码中objMatches(0)objMatches(0).SubMatches(0)是:

src="\\server\path\picture.png"

Versus:对比:

\\server\path\picture.png

You probably want to wrap this up as a function and call it when you get work out the value of IE.document.getElementById("NewsDetail").outerHTML in the If..End If block of your code.您可能希望将其包装为一个函数,并在您计算出If..End If代码块中IE.document.getElementById("NewsDetail").outerHTML的值时调用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM