简体   繁体   English

什么是通过xml进行搜索的最快方法

[英]What is a fastest way to do search through xml

Suppose i have an XML file, that i use as local database, like this): 假设我有一个XML文件,我用作本地数据库,如下所示:

<root>
 <address>
  <firstName></firstName>
  <lastName></lastName>
  <phone></phone>
 </address>
</root>

I have a couple of questions: 我有一些问题:
1. What will be a fastest way to find address(or addresses) in XML where firstName contains 'er' for example? 1. 在first中包含'er'的XML中寻找地址(或地址)的最快方法是什么?
2. Is it possible to do without whole loading of XML file in memory? 2. 是否可以在内存中没有完整加载XML文件?

PS I am not looking for XML file alternatives, ideally i need a search that not depend on count of addresses in XML file. PS我不是在寻找XML文件替代品,理想情况下我需要一个不依赖于XML文件中地址数的搜索。 But i am realist, and it seems to me that it not possible. 但我是现实主义者,在我看来,这是不可能的。

Update: I am using .net 4 更新:我正在使用.net 4
Thanks for suggestions, but it's more scientific task than practical.. I probably looking for more fastest ways than linq and xmltextreader. 感谢您的建议,但它比实际更科学的任务。我可能正在寻找比linq和xmltextreader更快的方法。

LINQ to Xml works pretty fine: LINQ to Xml非常好用:

XDocument doc = XDocument.Load("myfile.xml");
var addresses = from address in doc.Root.Elements("address")
                where address.Element("firstName").Value.Contains("er")
                select address;

UPDATE: Try to look at this question on StackOverflow: Best way to search data in xml files? 更新:尝试在StackOverflow上查看这个问题: 在xml文件中搜索数据的最佳方法是什么? .

Marc Gravell's accepted answer works using SQL indexing: Marc Gravell接受的答案使用SQL索引:

First: how big are the xml files? 第一:xml文件有多大? XmlDocument doesn't scale to "huge"... but can handle "large" OK. XmlDocument不会扩展为“巨大”......但可以处理“大”OK。

Second: can you perhaps put the data into a regular database structure (perhaps SQL Server Express Edition), index it, and access via regular TSQL? 第二:您可以将数据放入常规数据库结构(可能是SQL Server Express Edition),索引它,并通过常规TSQL访问吗? That will usually out-perform an xpath search. 这通常会超出xpath搜索范围。 Equally, if it is structured, SQL Server 2005 and above supports the xml data-type, which shreds data - this allows you to index and query xml data in the database without having the entire DOM in memory (it translates xpath into relational queries). 同样,如果它是结构化的,SQL Server 2005及更高版本支持分割数据的xml数据类型 - 这允许您在数据库中索引和查询xml数据,而不需要将整个DOM存储在内存中(它将xpath转换为关系查询) 。

UPDATE 2: Read also another link taken by the previous question that explains how the structure of the XML affects performances: http://www.15seconds.com/issue/010410.htm 更新2:阅读上一个问题所采用的另一个链接,该链接解释了XML结构如何影响性能: http//www.15seconds.com/issue/010410.htm

And what about XmlReader ? 那么XmlReader呢? I think it could be the fastest way... 我认为这可能是最快的方式......

I tried approx 110 MB file and it took about 1,1 sec. 我尝试了大约110 MB的文件,花了大约1,1秒。 Same file with LinqToXML (above) takes about 3 sec. 与LinqToXML(上面)相同的文件大约需要3秒。

XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
XmlReader reader = XmlReader.Create("C:\\Temp\\items.xml", settings);

String firstName = "", lastName = "", phone = "";
String lastTagName = "";
Boolean bItemFound = false;
long nCounter = 0;

Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();

reader.MoveToContent();
// Parse the file and display each of the nodes.
while (reader.Read())
{
    switch (reader.NodeType)
    {
        case XmlNodeType.Element:
            //Console.Write("<{0}>", reader.Name);

            lastTagName = reader.Name;

            if (lastTagName ==  "address")
                nCounter++;

            break;
        case XmlNodeType.Text:
            //Console.Write(reader.Value);
            switch (lastTagName)
            {
               case "firstName":
                    firstName = reader.Value.ToString();
                    bItemFound = firstName.Contains("97331");
                    break;
                case "lastName":
                    lastName = reader.Value.ToString();
                    break;
                case "phone":
                    phone = reader.Value.ToString();
                    break;
            }
            break;
        case XmlNodeType.CDATA:
            //Console.Write("<![CDATA[{0}]]>", reader.Value);
            break;
        case XmlNodeType.ProcessingInstruction:
            //Console.Write("<?{0} {1}?>", reader.Name, reader.Value);
            break;
        case XmlNodeType.Comment:
            //Console.Write("<!--{0}-->", reader.Value);
            break;
        case XmlNodeType.XmlDeclaration:
            //Console.Write("<?xml version='1.0'?>");
            break;
        case XmlNodeType.Document:
        case XmlNodeType.DocumentType:
            //Console.Write("<!DOCTYPE {0} [{1}]", reader.Name, reader.Value);
            break;
        case XmlNodeType.EntityReference:
            //Console.Write(reader.Name);
            break;
        case XmlNodeType.EndElement:
            //Console.Write("</{0}>", reader.Name);
            break;
    }

    if (bItemFound)
    {
        Console.Write("{0}\n{1}\n{2}\n", firstName, lastName, phone);
        bItemFound = false;
    }
}

stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
    ts.Hours, ts.Minutes, ts.Seconds,
    ts.Milliseconds / 10);
Console.WriteLine("RunTime " + elapsedTime);
Console.WriteLine("Searched items: {0}", nCounter);

Console.ReadKey();

If you have .NET 3.5+, consider using LINQ To XML . 如果您使用的是.NET 3.5+,请考虑使用LINQ To XML

Some sample code to give you some idea: (code below lifted/modified liberally from the article) 一些示例代码可以让您有所了解:(以下代码从文章中解放/修改)

IEnumerable<string> addresses =
    from inv in customer.Descendants("Invoice")
    where inv.Attribute("ProductName").StartsWith("er")
    select (string) inv.Attribute("StreetAddress");

You can use XmlTextReader if you don't want to read the whole file into memory. 如果您不想将整个文件读入内存,可以使用XmlTextReader。 Such solution will probably run faster, but it will involve more coding. 这样的解决方案可能运行得更快,但它将涉及更多编码。

I'm worried you might want to optimize something that might not need it. 我担心你可能想要优化可能不需要它的东西。 How many email addresses are we talking about? 我们在谈论多少个电子邮件地址? Most of the time you would read in the input and build a structure that supports the kind of queries you will be running. 大多数情况下,您将阅读输入并构建一个支持您将运行的查询类型的结构。

There are trees that can get to the kind of results you are looking for in order log(n) time. 有些树可以在log(n)时间内获得您正在寻找的那种结果。 And you can store a ton of addresses in even a small amount of memory. 而且你甚至可以在少量内存中存储大量地址。

If you really need not to do this on server side, you can do it with regular expressions. 如果您真的不需要在服务器端执行此操作,则可以使用正则表达式执行此操作。 But loading the XML on memmory would be faster I think... 但是我认为在memmory上加载XML会更快......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM