[英]VB.Net Search System Directory
我试图在我的应用程序中添加一个搜索框,该搜索框将根据输入的条件搜索共享驱动器。 我目前拥有的代码是:
Public Sub searchProcedure()
Dim startFolder As String = "C:\Documents and Settings\Practice Search"
Dim dir As New System.IO.DirectoryInfo(startFolder)
Dim fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories)
Dim searchTerm = "test string"
Dim queryMatchingFiles = From file In fileList _
Let fileText = GetFileText(file.FullName) _
Where fileText.Contains(searchTerm) _
Select file.FullName
'Where file.Extension = "." _ (removed so searches all files)
For Each filename In queryMatchingFiles
ListBox1.Items.Add(filename)
Next
End Sub
Function GetFileText(ByRef Name As String) As String
Dim fileContents = String.Empty
If System.IO.File.Exists(Name) Then
fileContents = System.IO.File.ReadAllText(Name)
End If
Return fileContents
End Function
我遇到的问题与Microsoft Office文档有关。 内容被读入我的filecontents字符串,但是内容以XML(?)表示。
关于如何将实际文本内容传递到用于搜索的字符串的任何想法?
谢谢!
当内容是使用Regex的XML或HTML时,您可以完全剥离标签
Regex.Replace(text, "<.*?>", "")
像这样:
Dim fileContents = String.Empty
If System.IO.File.Exists(Name) Then
fileContents = System.IO.File.ReadAllText(Name)
fileContents = Regex.Replace(fileContents, "<.*?>", "")
End If
Return fileContents
.docx文件实际上是包含XML文件的ZIP文件。 想到两种解决方案,都不容易:
如果安装了MS Word,请使用Word对象模型以编程方式打开docx文件并提取文本。 使用MS Office主互操作程序集 (PIA)更容易,但是将您限制为特定版本的Office。 最后,我更喜欢使用PIA进行开发,然后切换到后期绑定(即,将所有内容更改为“ Object”并摆脱PIA参考)。
使用#ZipLib打开.docx文件,然后使用System.Xml命名空间将XML分开。
选项1我认为对您来说会更容易。
得出的结论是,没有“开箱即用”的解决方案; 我正在处理每种文档类型。 使用OpenXML SDK,要从Word中提取的代码是:
Imports System.Xml.XmlReader
Imports System.IO
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports DocumentFormat.OpenXml.Spreadsheet
Imports DocumentFormat.OpenXml
Imports System.Linq
Public Sub WordProcessing()
Dim strDoc As String = "C:\Documents and Settings\Practice.docx"
Dim txt As String
Dim stream As Stream = File.Open(strDoc, FileMode.Open)
OpenAndAddtoWordProcessingStream(stream, txt)
stream.Close()
MessageBox.Show(txt)
End Sub
Public Sub OpenAndAddtoWordProcessingStream(ByVal stream As Stream, ByRef txt As String)
Dim wordprocessingDocument As WordprocessingDocument = wordprocessingDocument.Open(stream, True)
Dim body As Body = wordprocessingDocument.MainDocumentPart.Document.Body
txt = body.InnerText.ToString
wordprocessingDocument.Close()
End Sub
从Excel中提取的代码是:
Dim strDoc As String = "C:\Documents and Settings\Practice.xlsx"
Dim txt As String
Dim spreadsheetDocument As SpreadsheetDocument = spreadsheetDocument.Open(strDoc, False)
Dim workbookPart As WorkbookPart = spreadsheetDocument.WorkbookPart
Dim shareStringPart As SharedStringTablePart = workbookPart.SharedStringTablePart
For Each Item As SharedStringItem In shareStringPart.SharedStringTable.Elements(Of SharedStringItem)()
MessageBox.Show(Item.InnerText)
Next
接下来,我将研究.PDF,Access和Powerpoint。
我添加此内容,以便按照SSS的指示将完全回答该问题。 这是用于搜索Office文档,Office文档(x),pdf和其他通用文件格式的文本字符串的完整代码。
Imports System.IO
Imports System.Xml.XmlReader
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports DocumentFormat.OpenXml.Spreadsheet
Imports DocumentFormat.OpenXml
Imports System.Linq
Imports System
Imports System.Collections.Generic
Imports A = DocumentFormat.OpenXml.Drawing
Imports DocumentFormat.OpenXml.Presentation
Imports System.Text
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Module searchFiles
Public readAllText As String
Public Sub startSearch(ByVal searchText As String)
MainForm.marketIntelligencelboxsearch.Items.Clear()
Dim dir_info As New DirectoryInfo("\\Max1\dept\")
ListFiles(MainForm.marketIntelligencelboxsearch, dir_info, searchText)
End Sub
Private Sub ListFiles(ByVal lst As ListView, ByVal dir_info As DirectoryInfo, ByVal target As String)
' Get the files in this directory.
Dim fs_infos() As FileInfo = dir_info.GetFiles("*.*")
For Each fs_info As FileInfo In fs_infos
If target = "ALL" Or fs_info.ToString().IndexOf(target, StringComparison.OrdinalIgnoreCase) >= 0 Then
MainForm.marketIntelligencelboxsearch.Items.Add(System.IO.Path.GetFileName(fs_info.FullName), MainForm.sourceFileImageIndex(fs_info.FullName))
Else
readAllText = File.ReadAllText(fs_info.FullName)
If fileExtention(fs_info.FullName, target) <> 0 Then
MainForm.marketIntelligencelboxsearch.Items.Add(System.IO.Path.GetFileName(fs_info.FullName), MainForm.sourceFileImageIndex(fs_info.FullName))
End If
End If
Next fs_info
fs_infos = Nothing
' Search subdirectories.
Dim subdirs() As DirectoryInfo = dir_info.GetDirectories()
For Each subdir As DirectoryInfo In subdirs
ListFiles(lst, subdir, target)
Next subdir
End Sub
Public Function fileExtention(ByVal sourcePath As String, ByVal target As String) As Integer
Dim searchResult As Integer
Select Case True
Case InStr(sourcePath, ".docx") <> 0 Or InStr(sourcePath, ".docm")
searchResult = WordProcessing(sourcePath, target)
Return searchResult
Case InStr(LCase(sourcePath), ".xlsx") <> 0 Or InStr(LCase(sourcePath), ".xlsm") <> 0
searchResult = ExcelProcessing(sourcePath, target)
Return searchResult
Case InStr(LCase(sourcePath), ".pptx") <> 0 Or InStr(LCase(sourcePath), ".pptm") <> 0
'will read slide text and notes
searchResult = PowerpointProcessing(sourcePath, target)
Return searchResult
Case InStr(LCase(sourcePath), ".pdf") <> 0
'will search text in pdf
searchResult = pdfProcesssing(sourcePath, target)
Return searchResult
Case Else
'looks at office docs before 2007 and all other generic extensions, includes Access 2007 and lower
searchResult = catchallProcessing(readAllText, target)
Return searchResult
End Select
End Function
Public Function catchallProcessing(ByVal strDoc As String, ByVal target As String) As Integer
If Not (strDoc) Is Nothing Then
If strDoc.IndexOf(target, StringComparison.OrdinalIgnoreCase) >= 0 Then 'means it ignores the case, no indexof = searching inside
Return 1
Else
Return 0
End If
Else
Return 0
End If
End Function
Public Function WordProcessing(ByVal strDoc As String, ByVal target As String) As Integer ' Word 2007 and Higher
Dim txt As String
Dim stream As Stream = File.Open(strDoc, FileMode.Open)
Dim wordprocessingDocument As WordprocessingDocument = wordprocessingDocument.Open(stream, True)
Dim body As Body = wordprocessingDocument.MainDocumentPart.Document.Body
txt = body.InnerText.ToString
Return catchallProcessing(txt, target) 'should return 0 or 1
wordprocessingDocument.Close()
stream.Close()
End Function
Public Function ExcelProcessing(ByVal strDoc As String, ByVal target As String) As Integer 'Excel 2007 and Higher
Dim spreadsheetDocument As SpreadsheetDocument = spreadsheetDocument.Open(strDoc, False)
Dim workbookPart As WorkbookPart = spreadsheetDocument.WorkbookPart
Dim shareStringPart As SharedStringTablePart = workbookPart.SharedStringTablePart
Dim paragraphText As New StringBuilder()
For Each Item As SharedStringItem In shareStringPart.SharedStringTable.Elements(Of SharedStringItem)()
paragraphText.Append(Item.InnerText) 'should read all strings
Next
Return catchallProcessing(paragraphText.ToString(), target)
End Function
Public Function PowerpointProcessing(ByVal file As String, ByVal target As String) As Integer
Dim numberOfSlides As Integer = CountSlides(file)
Dim slideText As String = Nothing
Dim totalText As String = Nothing
For i As Integer = 0 To numberOfSlides - 1
GetSlideIdandText(slideText, file, i)
totalText = totalText & slideText
'System.Console.WriteLine("Slide #{0} contains: {1}", i + 1, slideText)
Next
Return catchallProcessing(totalText, target)
End Function
Public Function CountSlides(ByVal presentationFile As String) As Integer
Using powerpointDocument As PresentationDocument = PresentationDocument.Open(presentationFile, False)
Return CountSlides(powerpointDocument)
End Using
End Function
Public Function CountSlides(ByVal powerpointDocument As PresentationDocument) As Integer
If powerpointDocument Is Nothing Then
Throw New ArgumentNullException("presentationDocument")
End If
Dim slidesCount As Integer = 0
Dim presentationPart As PresentationPart = powerpointDocument.PresentationPart
If presentationPart IsNot Nothing Then
slidesCount = presentationPart.SlideParts.Count()
End If
Return slidesCount
End Function
Public Function GetSlideIdandText(ByRef sldText As String, ByVal docName As String, ByVal index As Integer)
Using ppt As PresentationDocument = PresentationDocument.Open(docName, False)
Dim part As PresentationPart = ppt.PresentationPart
Dim slideIDs As OpenXmlElementList = part.Presentation.SlideIdList.ChildElements
Dim relID As String = TryCast(slideIDs(index), SlideId).RelationshipId
Dim slide As SlidePart = DirectCast(part.GetPartById(relID), SlidePart)
Dim notesSlide As NotesSlidePart = slide.NotesSlidePart
Dim sn As NotesSlide = notesSlide.NotesSlide
Dim textx As IEnumerable(Of A.Text) = sn.Descendants(Of A.Text)()
Dim notesText As New StringBuilder()
For Each text As A.Text In textx
notesText.Append(text.Text)
Next
Dim paragraphText As New StringBuilder()
Dim texts As IEnumerable(Of A.Text) = slide.Slide.Descendants(Of A.Text)()
For Each text As A.Text In texts
paragraphText.Append(text.Text)
Next
sldText = paragraphText.ToString() & notesText.ToString() 'concatenates the notes and slide text for searching
End Using
End Function
Public Function pdfProcesssing(ByVal strDoc As String, ByVal target As String) As Integer
Dim oReader As New iTextSharp.text.pdf.PdfReader(strDoc)
Dim stringOut As StringBuilder = New StringBuilder()
If File.Exists(strDoc) Then
For i = 1 To oReader.NumberOfPages
Dim itsText As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
stringOut.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, itsText))
Next
End If
Return catchallProcessing(stringOut.ToString(), target)
End Function
End Module
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.