简体   繁体   English

从html文件中提取文本并导出为csv

[英]Extract text from html files and export as csv

我有来自旧网站的5109个html文件,我只想从<title>Title 1</title><span class="mtr_message"> Text exemple 1</span>提取文本,然后将结果导出到csv文件中,例如this:第一个单元格中的标题1和第二个单元格中的文本示例1

Try the below WSH VBS ode. 尝试以下WSH VBS信号。 Paste your paths, save it as .vbs file and run. 粘贴路径,将其另存为.vbs文件并运行。

Option Explicit

Dim sSourceFolder, sResultFile, sRes, oFile, sCont

sSourceFolder = "C:\Users\DELL\Desktop\tmp" ' source files folder path
sResultFile = "C:\Users\DELL\Desktop\tmp\result.csv" ' result csv file path
sRes = ""
With CreateObject("Scripting.FileSystemObject") 
    For Each oFile In .GetFolder(sSourceFolder).Files
        If LCase(.GetExtensionName(oFile.Name)) = "htm" And oFile.Size > 0 Then
            With .OpenTextFile(oFile.Path, 1, False, -2)
                If .AtEndOfStream Then sCont = "" Else sCont = .ReadAll
                .Close
            End With
            With CreateObject("VBScript.RegExp")
                .Global = True
                .IgnoreCase = True
                .Multiline = True
                .Pattern = "<title>(.*?)</title>[\s\S]*?<span class=""mtr_message"">(.*?)</span>"
                With .Execute(sCont)
                    If .Count = 1 Then sRes = sRes & """" & .Item(0).SubMatches(0) & """, """ & .Item(0).SubMatches(1) & """" & vbCrlf
                End With
            End With
        End If
    Next
    With .OpenTextFile(sResultFile, 2, True, 0)
        .Write sRes
        .Close
    End With
End With
MsgBox "Completed"

You might need to change files extension and encoding settings in the code. 您可能需要在代码中更改文件扩展名和编码设置。 Currently files with htm extension are processed, and are read .OpenTextFile(oFile.Path, 1, False, -2) with default encoding -2 (Unicode - -1 , ASCII - 0 ). 当前具有htm扩展名的文件将被处理,并以默认编码-2 (Unicode-- -1 ,ASCII- 0 )读取.OpenTextFile(oFile.Path, 1, False, -2) )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM