简体   繁体   English

从字符串中剥离HTML

[英]Stripping HTML From A String

I've tried a number of things but nothing seems to be working properly. 我尝试了很多东西,但似乎没有什么工作正常。 I have an Access DB and am writing code in VBA. 我有一个Access DB,我正在VBA中编写代码。 I have a string of HTML source code that I am interested in stripping all of the HTML code and Tags out of so that I just have plain text string with no html or tags left. 我有一串HTML源代码,我有兴趣剥离所有HTML代码和标签,以便我只有纯文本字符串,没有html或标签。 What is the best way to do this? 做这个的最好方式是什么?

Thanks 谢谢

One way that's as resilient as possible to bad markup; 一种尽可能具有弹性的标记;

with createobject("htmlfile")
    .open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .close
    msgbox "text=" & .body.outerText
end with
    Function StripHTML(cell As Range) As String  
 Dim RegEx As Object  
 Set RegEx = CreateObject("vbscript.regexp")  

 Dim sInput As String  
 Dim sOut As String  
 sInput = cell.Text  

 With RegEx  
   .Global = True  
   .IgnoreCase = True  
   .MultiLine = True  
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.  
 End With  

 sOut = RegEx.Replace(sInput, "")  
 StripHTML = sOut  
 Set RegEx = Nothing  
End Function  

This might help you, Good luck. 祝你好运,祝你好运。

It depends how complex the html structure is and how much data you're wanting out of it. 这取决于html结构的复杂程度以及您希望从中获取多少数据。

Depending on the complexity you might get away with regular expressions, but for complex markup trying to parse data from html with regex is like trying to eat soup with a fork. 根据您使用正则表达式可能会带来的复杂性,但是对于复杂的标记,尝试使用正则表达式从html解析数据就像尝试用叉子吃汤一样。

You can use the htmFile object to turn the flat file into objects that you can interact with, for example: 您可以使用htmFile对象将平面文件转换为可以与之交互的对象,例如:

Function ParseATable(url As String) As Variant 

    Dim htm As Object, table As Object 
    Dim data() As String, x As Long, y As Long 
    Set htm = CreateObject("HTMLfile") 
    With CreateObject("MSXML2.XMLHTTP") 
        .Open "GET", url, False 
        .send 
        htm.body.innerhtml = .responsetext 
    End With 

    With htm 
        Set table = .getelementsbytagname("table")(0) 
        Redim data(1 To table.Rows.Length, 1 To 10) 
        For x = 0 To table.Rows.Length - 1 
            For y = 0 To table.Rows(x).Cells.Length - 1 
                data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText 
            Next y 
        Next x 

        ParseATable = data 

    End With 
End Function 

Using early binding: 使用早期绑定:

Public Function GetText(inputHtml As String) As String
With New HTMLDocument
    .Open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .Close
   StripHtml = .body.outerText
End With
End Function

An improvement over one of the above... It finds quotes and line feeds and replaces them with the non-HTML equivalents. 对上述之一的改进...它找到引号和换行符,并用非HTML等价物替换它们。 Also, the original function had a problem with embedded UNC references (ie: <\\server\\share\\folder\\file.ext>). 此外,原始函数有嵌入式UNC引用的问题(即:<\\ server \\ share \\ folder \\ file.ext>)。 It would remove the entire UNC string due to < at the beginning and > at the end. 它将删除整个UNC字符串,因为<在开头和>结尾。 This function fixes that so the UNC gets inserted into the string correctly: 此函数修复了这个问题,因此UNC正确地插入到字符串中:

Function StripHTML(strString As String) As String
 Dim RegEx As Object
 Set RegEx = CreateObject("vbscript.regexp")

 Dim sInput As String
 Dim sOut As String
 sInput = Replace(strString, "<\\", "\\")

 With RegEx
   .Global = True
   .IgnoreCase = True
   .MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
 End With

 sOut = RegEx.Replace(sInput, "")
 StripHTML = Replace(Replace(Replace(sOut, "&nbsp;", vbCrLf, 1, -    1), "&quot;", "'", 1, -1), "\\", "<\\", 1, -1)
 Set RegEx = Nothing
End Function

I found a really simple solutions to this. 我发现了一个非常简单的解决方案。 I currently run an access database and use excel forms to update the system due to system restrictions and shared drive privileges. 我目前运行访问数据库并使用excel表单来更新系统,因为系统限制和共享驱动器权限。 when I call the data from Access I use: Plaintext( YourStringHere ) this will remove all html parts and only leave the text. 当我从Access调用数据时,我使用:Plaintext( YourStringHere )这将删除所有html部分并仅保留文本。

hope this works. 希望这有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM