繁体   English   中英

VB.Net搜索系统目录

[英]VB.Net Search System Directory

我试图在我的应用程序中添加一个搜索框,该搜索框将根据输入的条件搜索共享驱动器。 我目前拥有的代码是:

Public Sub searchProcedure()

    Dim startFolder As String = "C:\Documents and Settings\Practice Search"

    Dim dir As New System.IO.DirectoryInfo(startFolder)
    Dim fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories)

    Dim searchTerm = "test string"

    Dim queryMatchingFiles = From file In fileList _
                             Let fileText = GetFileText(file.FullName) _
                             Where fileText.Contains(searchTerm) _
                             Select file.FullName

    'Where file.Extension = "." _ (removed so searches all files)

    For Each filename In queryMatchingFiles
        ListBox1.Items.Add(filename)
    Next

End Sub


Function GetFileText(ByRef Name As String) As String

    Dim fileContents = String.Empty

    If System.IO.File.Exists(Name) Then

        fileContents = System.IO.File.ReadAllText(Name)

    End If

    Return fileContents

End Function

我遇到的问题与Microsoft Office文档有关。 内容被读入我的filecontents字符串,但是内容以XML(?)表示。

关于如何将实际文本内容传递到用于搜索的字符串的任何想法?

谢谢!

当内容是使用Regex的XML或HTML时,您可以完全剥离标签

Regex.Replace(text, "<.*?>", "")

像这样:

Dim fileContents = String.Empty

If System.IO.File.Exists(Name) Then

    fileContents = System.IO.File.ReadAllText(Name)
    fileContents = Regex.Replace(fileContents, "<.*?>", "")
End If

Return fileContents

.docx文件实际上是包含XML文件的ZIP文件。 想到两种解决方案,都不容易:

  1. 如果安装了MS Word,请使用Word对象模型以编程方式打开docx文件并提取文本。 使用MS Office主互操作程序集 (PIA)更容易,但是将您限制为特定版本的Office。 最后,我更喜欢使用PIA进行开发,然后切换到后期绑定(即,将所有内容更改为“ Object”并摆脱PIA参考)。

  2. 使用#ZipLib打开.docx文件,然后使用System.Xml命名空间将XML分开。

选项1我认为对您来说会更容易。

得出的结论是,没有“开箱即用”的解决方案; 我正在处理每种文档类型。 使用OpenXML SDK,要从Word中提取的代码是:

Imports System.Xml.XmlReader
Imports System.IO
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports DocumentFormat.OpenXml.Spreadsheet
Imports DocumentFormat.OpenXml
Imports System.Linq

Public Sub WordProcessing()


    Dim strDoc As String = "C:\Documents and Settings\Practice.docx"
    Dim txt As String

    Dim stream As Stream = File.Open(strDoc, FileMode.Open)

    OpenAndAddtoWordProcessingStream(stream, txt)

    stream.Close()

    MessageBox.Show(txt)



End Sub

Public Sub OpenAndAddtoWordProcessingStream(ByVal stream As Stream, ByRef txt As String)


    Dim wordprocessingDocument As WordprocessingDocument = wordprocessingDocument.Open(stream, True)

    Dim body As Body = wordprocessingDocument.MainDocumentPart.Document.Body

    txt = body.InnerText.ToString

    wordprocessingDocument.Close()

End Sub

从Excel中提取的代码是:

  Dim strDoc As String = "C:\Documents and Settings\Practice.xlsx"
    Dim txt As String

    Dim spreadsheetDocument As SpreadsheetDocument = spreadsheetDocument.Open(strDoc, False)

    Dim workbookPart As WorkbookPart = spreadsheetDocument.WorkbookPart
    Dim shareStringPart As SharedStringTablePart = workbookPart.SharedStringTablePart


    For Each Item As SharedStringItem In shareStringPart.SharedStringTable.Elements(Of SharedStringItem)()

        MessageBox.Show(Item.InnerText)

    Next

接下来,我将研究.PDF,Access和Powerpoint。

我添加此内容,以便按照SSS的指示将完全回答该问题。 这是用于搜索Office文档,Office文档(x),pdf和其他通用文件格式的文本字符串的完整代码。

Imports System.IO
Imports System.Xml.XmlReader
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports DocumentFormat.OpenXml.Spreadsheet
Imports DocumentFormat.OpenXml
Imports System.Linq
Imports System
Imports System.Collections.Generic
Imports A = DocumentFormat.OpenXml.Drawing
Imports DocumentFormat.OpenXml.Presentation
Imports System.Text
Imports iTextSharp.text
Imports iTextSharp.text.pdf

Module searchFiles

Public readAllText As String

Public Sub startSearch(ByVal searchText As String)

    MainForm.marketIntelligencelboxsearch.Items.Clear()

    Dim dir_info As New DirectoryInfo("\\Max1\dept\")

    ListFiles(MainForm.marketIntelligencelboxsearch, dir_info, searchText)

End Sub


Private Sub ListFiles(ByVal lst As ListView, ByVal dir_info As DirectoryInfo, ByVal target As String)
    ' Get the files in this directory.
    Dim fs_infos() As FileInfo = dir_info.GetFiles("*.*")
    For Each fs_info As FileInfo In fs_infos
        If target = "ALL" Or fs_info.ToString().IndexOf(target, StringComparison.OrdinalIgnoreCase) >= 0 Then
            MainForm.marketIntelligencelboxsearch.Items.Add(System.IO.Path.GetFileName(fs_info.FullName), MainForm.sourceFileImageIndex(fs_info.FullName))
        Else

            readAllText = File.ReadAllText(fs_info.FullName)

            If fileExtention(fs_info.FullName, target) <> 0 Then
                MainForm.marketIntelligencelboxsearch.Items.Add(System.IO.Path.GetFileName(fs_info.FullName), MainForm.sourceFileImageIndex(fs_info.FullName))
            End If
        End If
    Next fs_info
    fs_infos = Nothing

    ' Search subdirectories.
    Dim subdirs() As DirectoryInfo = dir_info.GetDirectories()
    For Each subdir As DirectoryInfo In subdirs
        ListFiles(lst, subdir, target)
    Next subdir
End Sub


Public Function fileExtention(ByVal sourcePath As String, ByVal target As String) As Integer

    Dim searchResult As Integer

    Select Case True

        Case InStr(sourcePath, ".docx") <> 0 Or InStr(sourcePath, ".docm")
            searchResult = WordProcessing(sourcePath, target)
            Return searchResult

        Case InStr(LCase(sourcePath), ".xlsx") <> 0 Or InStr(LCase(sourcePath), ".xlsm") <> 0
            searchResult = ExcelProcessing(sourcePath, target)
            Return searchResult

        Case InStr(LCase(sourcePath), ".pptx") <> 0 Or InStr(LCase(sourcePath), ".pptm") <> 0
            'will read slide text and notes
            searchResult = PowerpointProcessing(sourcePath, target)
            Return searchResult

        Case InStr(LCase(sourcePath), ".pdf") <> 0
            'will search text in pdf
            searchResult = pdfProcesssing(sourcePath, target)
            Return searchResult

        Case Else
            'looks at office docs before 2007 and all other generic  extensions, includes Access 2007 and lower
            searchResult = catchallProcessing(readAllText, target)
            Return searchResult
    End Select


End Function

区域“搜索索引”

Public Function catchallProcessing(ByVal strDoc As String, ByVal target As String) As Integer

    If Not (strDoc) Is Nothing Then
        If strDoc.IndexOf(target, StringComparison.OrdinalIgnoreCase) >= 0 Then 'means it ignores the case, no indexof = searching inside
            Return 1

        Else

            Return 0

        End If
    Else

        Return 0
    End If

End Function

末端区域

区域“ Word 2007处理”

Public Function WordProcessing(ByVal strDoc As String, ByVal target As String) As Integer  ' Word 2007 and Higher

    Dim txt As String

    Dim stream As Stream = File.Open(strDoc, FileMode.Open)

    Dim wordprocessingDocument As WordprocessingDocument = wordprocessingDocument.Open(stream, True)

    Dim body As Body = wordprocessingDocument.MainDocumentPart.Document.Body

    txt = body.InnerText.ToString
    Return catchallProcessing(txt, target) 'should return 0 or 1

    wordprocessingDocument.Close()
    stream.Close()

End Function

末端区域

区域“ Excel 2007处理”

Public Function ExcelProcessing(ByVal strDoc As String, ByVal target As String) As Integer 'Excel 2007 and Higher

    Dim spreadsheetDocument As SpreadsheetDocument = spreadsheetDocument.Open(strDoc, False)

    Dim workbookPart As WorkbookPart = spreadsheetDocument.WorkbookPart
    Dim shareStringPart As SharedStringTablePart = workbookPart.SharedStringTablePart

    Dim paragraphText As New StringBuilder()

    For Each Item As SharedStringItem In shareStringPart.SharedStringTable.Elements(Of SharedStringItem)()

        paragraphText.Append(Item.InnerText) 'should read all strings

    Next

    Return catchallProcessing(paragraphText.ToString(), target)

End Function

末端区域

区域“ Powerpoint 2007处理”

Public Function PowerpointProcessing(ByVal file As String, ByVal target As String) As Integer

    Dim numberOfSlides As Integer = CountSlides(file)

    Dim slideText As String = Nothing
    Dim totalText As String = Nothing

    For i As Integer = 0 To numberOfSlides - 1
        GetSlideIdandText(slideText, file, i)
        totalText = totalText & slideText
        'System.Console.WriteLine("Slide #{0} contains: {1}", i + 1, slideText)
    Next

    Return catchallProcessing(totalText, target)

End Function

Public Function CountSlides(ByVal presentationFile As String) As Integer

    Using powerpointDocument As PresentationDocument = PresentationDocument.Open(presentationFile, False)

        Return CountSlides(powerpointDocument)

    End Using


End Function

Public Function CountSlides(ByVal powerpointDocument As PresentationDocument) As Integer


    If powerpointDocument Is Nothing Then

        Throw New ArgumentNullException("presentationDocument")

    End If

    Dim slidesCount As Integer = 0

    Dim presentationPart As PresentationPart = powerpointDocument.PresentationPart

    If presentationPart IsNot Nothing Then

        slidesCount = presentationPart.SlideParts.Count()

    End If

    Return slidesCount

End Function

Public Function GetSlideIdandText(ByRef sldText As String, ByVal docName As String, ByVal index As Integer)


    Using ppt As PresentationDocument = PresentationDocument.Open(docName, False)

        Dim part As PresentationPart = ppt.PresentationPart
        Dim slideIDs As OpenXmlElementList = part.Presentation.SlideIdList.ChildElements
        Dim relID As String = TryCast(slideIDs(index), SlideId).RelationshipId


        Dim slide As SlidePart = DirectCast(part.GetPartById(relID), SlidePart)
        Dim notesSlide As NotesSlidePart = slide.NotesSlidePart
        Dim sn As NotesSlide = notesSlide.NotesSlide


        Dim textx As IEnumerable(Of A.Text) = sn.Descendants(Of A.Text)()
        Dim notesText As New StringBuilder()

        For Each text As A.Text In textx

            notesText.Append(text.Text)

        Next


        Dim paragraphText As New StringBuilder()

        Dim texts As IEnumerable(Of A.Text) = slide.Slide.Descendants(Of A.Text)()

        For Each text As A.Text In texts
            paragraphText.Append(text.Text)
        Next

        sldText = paragraphText.ToString() & notesText.ToString() 'concatenates the notes and slide text for searching

    End Using


End Function

末端区域

区域“ PDF处理”

Public Function pdfProcesssing(ByVal strDoc As String, ByVal target As String) As Integer


    Dim oReader As New iTextSharp.text.pdf.PdfReader(strDoc)
    Dim stringOut As StringBuilder = New StringBuilder()

    If File.Exists(strDoc) Then


        For i = 1 To oReader.NumberOfPages

            Dim itsText As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
            stringOut.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, itsText))

        Next


    End If

    Return catchallProcessing(stringOut.ToString(), target)

End Function

末端区域

End Module

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM