简体   繁体   English

使用Lucene.NET为.PDF,.XLS,.DOC,.PPT编制索引

[英]Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET

I've heard of Lucene.Net and I've heard of Apache Tika . 我听说过Lucene.Net ,也听说过Apache Tika The question is - how do I index these documents using C# vs Java? 问题是-如何使用C#和Java索引这些文档? I think the issue is that there is no .Net equivalent of Tika which extracts relevant text from these document types. 我认为问题在于,没有与Tika等效的.Net从这些文档类型中提取相关文本。

UPDATE - Feb 05 2011 更新-2011年2月5日

Based on given responses, it seems that the is not currently a native .Net equivalent of Tika. 根据给定的响应,似乎目前不是Tika的本机 .Net等效项。 2 interesting projects were mentioned that are each interesting in their own right: 提到了两个有趣的项目,每个项目本身都很有趣:

  1. Xapian Project ( http://xapian.org/ ) - An alternative to Lucene written in unmanaged code. Xapian项目http://xapian.org/)-用非托管代码编写的Lucene的替代方法。 The project claims to support "swig" which allows for C# bindings. 该项目声称支持允许C#绑定的“ swig”。 Within the Xapian Project there is an out-of-the-box search engine called Omega. Xapian项目中有一个开箱即用的搜索引擎,称为Omega。 Omega uses a variety of open source components to extract text from various document types. Omega使用各种开源组件从各种文档类型中提取文本。
  2. IKVM.NET ( http://www.ikvm.net/ ) - Allows Java to be run from .Net. IKVM.NEThttp://www.ikvm.net/ )-允许从.Net运行Java。 An example of using IKVM to run Tika can be found here . 此处可以找到使用IKVM运行Tika的示例。

Given the above 2 projects, I see a couple of options. 鉴于以上两个项目,我看到了两个选择。 To extract the text, I could either a) use the same components that Omega is using or b) use IKVM to run Tika. 要提取文本,我可以a)使用Omega使用的相同组件,或者b)使用IKVM运行Tika。 To me, option b) seems cleaner as there are only 2 dependencies. 对我来说,选项b)看起来更干净,因为只有2个依赖项。

The interesting part is that now there are several search engines that could probably be used from .Net. 有趣的是,.Net现在可能会使用几个搜索引擎。 There is Xapian, Lucene.Net or even Lucene (using IKVM). 有Xapian,Lucene.Net甚至Lucene(使用IKVM)。

UPDATE - Feb 07 2011 更新-2011年2月7日

Another answer came in recommending that I check out ifilters. 另一个答案是建议我检查ifilters。 As it turns out, this is what MS uses for windows search so Office ifilters are readily available. 事实证明,这就是MS用于Windows搜索的用途,因此Office ifilter随时可用。 Also, there are some PDF ifilters out there. 此外,那里还有一些PDF ifilter。 The downside is that they are implemented in unmanaged code, so COM interop is necessary to use them. 缺点是它们是在非托管代码中实现的,因此使用COM互操作是必需的。 I found the below code snippit on a DotLucene.NET archive (no longer an active project): 我在DotLucene.NET归档文件(不再是活动项目)中找到了以下代码片段:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Text;

namespace IFilter
{
    [Flags]
    public enum IFILTER_INIT : uint
    {
        NONE = 0,
        CANON_PARAGRAPHS = 1,
        HARD_LINE_BREAKS = 2,
        CANON_HYPHENS = 4,
        CANON_SPACES = 8,
        APPLY_INDEX_ATTRIBUTES = 16,
        APPLY_CRAWL_ATTRIBUTES = 256,
        APPLY_OTHER_ATTRIBUTES = 32,
        INDEXING_ONLY = 64,
        SEARCH_LINKS = 128,
        FILTER_OWNED_VALUE_OK = 512
    }

    public enum CHUNK_BREAKTYPE
    {
        CHUNK_NO_BREAK = 0,
        CHUNK_EOW = 1,
        CHUNK_EOS = 2,
        CHUNK_EOP = 3,
        CHUNK_EOC = 4
    }

    [Flags]
    public enum CHUNKSTATE
    {
        CHUNK_TEXT = 0x1,
        CHUNK_VALUE = 0x2,
        CHUNK_FILTER_OWNED_VALUE = 0x4
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct PROPSPEC
    {
        public uint ulKind;
        public uint propid;
        public IntPtr lpwstr;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FULLPROPSPEC
    {
        public Guid guidPropSet;
        public PROPSPEC psProperty;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct STAT_CHUNK
    {
        public uint idChunk;
        [MarshalAs(UnmanagedType.U4)] public CHUNK_BREAKTYPE breakType;
        [MarshalAs(UnmanagedType.U4)] public CHUNKSTATE flags;
        public uint locale;
        [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;
        public uint idChunkSource;
        public uint cwcStartSource;
        public uint cwcLenSource;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FILTERREGION
    {
        public uint idChunk;
        public uint cwcStart;
        public uint cwcExtent;
    }

    [ComImport]
    [Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]
    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    public interface IFilter
    {
        [PreserveSig]
        int Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, uint cAttributes, [MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes, ref uint pdwFlags);

        [PreserveSig]
        int GetChunk(out STAT_CHUNK pStat);

        [PreserveSig]
        int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        void GetValue(ref UIntPtr ppPropValue);
        void BindRegion([MarshalAs(UnmanagedType.Struct)] FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);
    }

    [ComImport]
    [Guid("f07f3920-7b8c-11cf-9be8-00aa004b9986")]
    public class CFilter
    {
    }

    public class IFilterConstants
    {
        public const uint PID_STG_DIRECTORY = 0x00000002;
        public const uint PID_STG_CLASSID = 0x00000003;
        public const uint PID_STG_STORAGETYPE = 0x00000004;
        public const uint PID_STG_VOLUME_ID = 0x00000005;
        public const uint PID_STG_PARENT_WORKID = 0x00000006;
        public const uint PID_STG_SECONDARYSTORE = 0x00000007;
        public const uint PID_STG_FILEINDEX = 0x00000008;
        public const uint PID_STG_LASTCHANGEUSN = 0x00000009;
        public const uint PID_STG_NAME = 0x0000000a;
        public const uint PID_STG_PATH = 0x0000000b;
        public const uint PID_STG_SIZE = 0x0000000c;
        public const uint PID_STG_ATTRIBUTES = 0x0000000d;
        public const uint PID_STG_WRITETIME = 0x0000000e;
        public const uint PID_STG_CREATETIME = 0x0000000f;
        public const uint PID_STG_ACCESSTIME = 0x00000010;
        public const uint PID_STG_CHANGETIME = 0x00000011;
        public const uint PID_STG_CONTENTS = 0x00000013;
        public const uint PID_STG_SHORTNAME = 0x00000014;
        public const int FILTER_E_END_OF_CHUNKS = (unchecked((int) 0x80041700));
        public const int FILTER_E_NO_MORE_TEXT = (unchecked((int) 0x80041701));
        public const int FILTER_E_NO_MORE_VALUES = (unchecked((int) 0x80041702));
        public const int FILTER_E_NO_TEXT = (unchecked((int) 0x80041705));
        public const int FILTER_E_NO_VALUES = (unchecked((int) 0x80041706));
        public const int FILTER_S_LAST_TEXT = (unchecked((int) 0x00041709));
    }

    /// 
    /// IFilter return codes
    /// 
    public enum IFilterReturnCodes : uint
    {
        /// 
        /// Success
        /// 
        S_OK = 0,
        /// 
        /// The function was denied access to the filter file. 
        /// 
        E_ACCESSDENIED = 0x80070005,
        /// 
        /// The function encountered an invalid handle, probably due to a low-memory situation. 
        /// 
        E_HANDLE = 0x80070006,
        /// 
        /// The function received an invalid parameter.
        /// 
        E_INVALIDARG = 0x80070057,
        /// 
        /// Out of memory
        /// 
        E_OUTOFMEMORY = 0x8007000E,
        /// 
        /// Not implemented
        /// 
        E_NOTIMPL = 0x80004001,
        /// 
        /// Unknown error
        /// 
        E_FAIL = 0x80000008,
        /// 
        /// File not filtered due to password protection
        /// 
        FILTER_E_PASSWORD = 0x8004170B,
        /// 
        /// The document format is not recognised by the filter
        /// 
        FILTER_E_UNKNOWNFORMAT = 0x8004170C,
        /// 
        /// No text in current chunk
        /// 
        FILTER_E_NO_TEXT = 0x80041705,
        /// 
        /// No more chunks of text available in object
        /// 
        FILTER_E_END_OF_CHUNKS = 0x80041700,
        /// 
        /// No more text available in chunk
        /// 
        FILTER_E_NO_MORE_TEXT = 0x80041701,
        /// 
        /// No more property values available in chunk
        /// 
        FILTER_E_NO_MORE_VALUES = 0x80041702,
        /// 
        /// Unable to access object
        /// 
        FILTER_E_ACCESS = 0x80041703,
        /// 
        /// Moniker doesn't cover entire region
        /// 
        FILTER_W_MONIKER_CLIPPED = 0x00041704,
        /// 
        /// Unable to bind IFilter for embedded object
        /// 
        FILTER_E_EMBEDDING_UNAVAILABLE = 0x80041707,
        /// 
        /// Unable to bind IFilter for linked object
        /// 
        FILTER_E_LINK_UNAVAILABLE = 0x80041708,
        /// 
        /// This is the last text in the current chunk
        /// 
        FILTER_S_LAST_TEXT = 0x00041709,
        /// 
        /// This is the last value in the current chunk
        /// 
        FILTER_S_LAST_VALUES = 0x0004170A
    }

    /// 
    /// Convenience class which provides static methods to extract text from files using installed IFilters
    /// 
    public class DefaultParser
    {
        public DefaultParser()
        {
        }

        [DllImport("query.dll", CharSet = CharSet.Unicode)]
        private extern static int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

        private static IFilter loadIFilter(string filename)
        {
            object outer = null;
            IFilter filter = null;

            // Try to load the corresponding IFilter
            int resultLoad = LoadIFilter(filename,  outer, ref filter);
            if (resultLoad != (int) IFilterReturnCodes.S_OK)
            {
                return null;
            }
            return filter;
        }

        public static bool IsParseable(string filename)
        {
            return loadIFilter(filename) != null;
        }

        public static string Extract(string path)
        {
            StringBuilder sb = new StringBuilder();
            IFilter filter = null;

            try
            {
                filter = loadIFilter(path);

                if (filter == null)
                    return String.Empty;

                uint i = 0;
                STAT_CHUNK ps = new STAT_CHUNK();

                IFILTER_INIT iflags =
                    IFILTER_INIT.CANON_HYPHENS |
                    IFILTER_INIT.CANON_PARAGRAPHS |
                    IFILTER_INIT.CANON_SPACES |
                    IFILTER_INIT.APPLY_CRAWL_ATTRIBUTES |
                    IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
                    IFILTER_INIT.APPLY_OTHER_ATTRIBUTES |
                    IFILTER_INIT.HARD_LINE_BREAKS |
                    IFILTER_INIT.SEARCH_LINKS |
                    IFILTER_INIT.FILTER_OWNED_VALUE_OK;

                if (filter.Init(iflags, 0, null, ref i) != (int) IFilterReturnCodes.S_OK)
                    throw new Exception("Problem initializing an IFilter for:\n" + path + " \n\n");

                while (filter.GetChunk(out ps) == (int) (IFilterReturnCodes.S_OK))
                {
                    if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
                    {
                        IFilterReturnCodes scode = 0;
                        while (scode == IFilterReturnCodes.S_OK || scode == IFilterReturnCodes.FILTER_S_LAST_TEXT)
                        {
                            uint pcwcBuffer = 65536;
                            System.Text.StringBuilder sbBuffer = new System.Text.StringBuilder((int)pcwcBuffer);

                            scode = (IFilterReturnCodes) filter.GetText(ref pcwcBuffer, sbBuffer);

                            if (pcwcBuffer > 0 && sbBuffer.Length > 0)
                            {
                                if (sbBuffer.Length < pcwcBuffer) // Should never happen, but it happens !
                                    pcwcBuffer = (uint)sbBuffer.Length;

                                sb.Append(sbBuffer.ToString(0, (int) pcwcBuffer));
                                sb.Append(" "); // "\r\n"
                            }

                        }
                    }

                }
            }
            finally
            {
                if (filter != null) {
                    Marshal.ReleaseComObject (filter);
                    System.GC.Collect();
                    System.GC.WaitForPendingFinalizers();
                }
            }

            return sb.ToString();
        }
    }
}

At the moment, this seems like the best way to extract text from documents using the .NET platform on a Windows server. 目前,这似乎是使用Windows服务器上的.NET平台从文档中提取文本的最佳方法。 Thanks everybody for your help. 谢谢大家的帮助。

UPDATE - Mar 08 2011 更新-2011年3月8日

While I still think that ifilters are a good way to go, I think if you are looking to index documents using Lucene from .NET, a very good alternative would be to use Solr . 虽然我仍然认为ifilters是一个不错的选择,但我认为如果您希望使用.NET中的Lucene为文档建立索引,那么一个很好的替代方法是使用Solr When I first started researching this topic, I had never heard of Solr. 当我第一次开始研究这个主题时,我从未听说过Solr。 So, for those of you who have not either, Solr is a stand-alone search service, written in Java on top of Lucene. 因此,对于没有一个人的人,Solr是一个独立的搜索服务,在Lucene之上用Java编写。 The idea is that you can fire up Solr on a firewalled machine, and communicate with it via HTTP from your .NET application. 这个想法是,您可以在受防火墙保护的计算机上启动Solr,并通过.NET应用程序中的HTTP通过它进行通信。 Solr is truly written like a service and can do everything Lucene can do, (including using Tika extract text from .PDF, .XLS, .DOC, .PPT, etc), and then some. Solr确实像服务一样编写,并且可以完成Lucene可以做的所有事情(包括使用Tika从.PDF,.XLS,.DOC,.PPT等中提取文本),然后再做一些。 Solr seems to have a very active community as well, which is one thing I am not to sure of with regards to Lucene.NET. Solr似乎也有一个非常活跃的社区,对于Lucene.NET,这是我不确定的一件事。

You can also check out ifilters - there are a number of resources if you do a search for asp.net ifilters: 您还可以签出ifilters-如果您搜索asp.net ifilters,则有很多资源:

Of course, there is added hassle if you are distributing this to client systems, because you will either need to include the ifilters with your distribution and install those with your app on their machine, or they will lack the ability to extract text from any files they don't have ifilters for. 当然,如果将其分发到客户端系统,则会增加麻烦,因为您将需要在分发中包括ifilter并将其安装在其计算机上,否则它们将无法从任何文件中提取文本他们没有ifilters。

This is one of the reasons I was dissatisfied with Lucene for a project I was working on. 这是我对Lucene所做的项目不满意的原因之一。 Xapian is a competing product, and is orders of magnitude faster than Lucene in some cases and has other compelling features (well, they were compelling to me at the time). Xapian是一种竞争产品,在某些情况下比Lucene快几个数量级,并具有其他引人注目的功能(嗯,当时它们对我来说是引人注目的)。 The big issue? 大问题? It's written in C++ and you have to interop to it. 它是用C ++编写的,您必须对其进行互操作。 That's for indexing and retrieval. 那是用于索引和检索。 For the actual parsing of the text, that's where Lucene really falls down -- you have to do it yourself. 对于文本的实际解析,这是Lucene真正失败的地方-您必须自己做。 Xapian has an omega component that manages calling other third party components to extract data. Xapian具有一个omega组件,该组件负责管理调用其他第三方组件以提取数据。 In my limited testing it worked pretty darn well. 在我有限的测试中,它工作得很好。 I did not finish the project (more than POC) but I did write up my experience compiling it for 64 bit. 我没有完成项目(比POC还多),但是我确实写下了将其编译为64位的经验。 Of course this was almost a year ago, so things might have changed. 当然,这已经快一年了,所以情况可能已经改变。

If you dig into the Omega documentation you can see the tools that they use to parse documents. 如果深入研究Omega文档 ,则可以看到它们用于解析文档的工具。

PDF (.pdf) if pdftotext is available (comes with xpdf) PDF(.pdf),如果pdftotext可用(xpdf附带)

PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes with xpdf) are available 如果ps2pdf(来自ghostscript)和pdftotext(xpdf附带)可用,则为PostScript(.ps,.eps,.ai)

OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) if unzip is available OpenOffice / StarOffice文件(.sxc,.stc,.sxd,.std,.sxi,.sti,.sxm,.sxw,.sxg,.stw)

OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is available OpenDocument格式的文档(.odt,.ods,.odp,.odg,.odc,.odf,.odb,.odi,.odm,.ott,.ots,.otp,.otg,.otc,.otf 、. oti,.oth)(如果可用)

MS Word documents (.doc, .dot) if antiword is available MS Word文档(.doc,.dot)(如果可用)

MS Excel documents (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc) 如果xls2csv可用,则MS Excel文档(.xls,.xlb,.xlt)(catdoc附带)

MS Powerpoint documents (.ppt, .pps) if catppt is available, (comes with catdoc) MS Powerpoint文档(.ppt,.pps)(如果有catppt可用)(catdoc附带)

MS Office 2007 documents (.docx, .dotx, .xlsx, .xlst, .pptx, .potx, .ppsx) if unzip is available 如果解压缩可用,则MS Office 2007文档(.docx,.dotx,.xlsx,.xlst,.pptx,.potx,.ppsx)

Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd) 如果有wpd2text可用,则为Wordperfect文档(.wpd)(libwpd附带)

MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps) 如果wps2text可用,则MS Works文档(.wps,.wpt)(libwps附带)

Compressed AbiWord documents (.zabw) if gzip is available 如果有gzip,则压缩的AbiWord文档(.zabw)

Rich Text Format documents (.rtf) if unrtf is available RTF可用的RTF格式文档(.rtf)

Perl POD documentation (.pl, .pm, .pod) if pod2text is available 如果可以使用pod2text,则可以使用Perl POD文档(.pl,.pm,.pod)

TeX DVI files (.dvi) if catdvi is available TeX DVI文件(.dvi)(如果有catdvi可用)

DjVu files (.djv, .djvu) if djvutxt is available DjVu文件(.djv,.djvu)(如果djvutxt可用)

XPS files (.xps) if unzip is available XPS文件(.xps)(如果存在解压缩)

Apparently you can use Tika from .net ( link ) 显然,您可以从.net使用Tika( 链接

I have not tried this myself. 我自己还没有尝试过。

Other angle here is that Lucene indexes are binary compatible between java and .NET. 这里的另一个角度是Lucene索引在Java和.NET之间是二进制兼容的。 So you could write the index with Tika and read it with C#. 因此,您可以使用Tika编写索引并使用C#读取索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM