简体   繁体   English

在C#中解析XML文件的最快方法?

[英]Fastest way to parse XML files in C#?

I have to load many XML files from internet. 我必须从Internet加载许多XML文件。 But for testing with better speed i downloaded all of them (more than 500 files) of the following format. 但是为了以更好的速度进行测试,我下载了以下格式的所有内容(超过500个文件)。

<player-profile>
  <personal-information>
    <id>36</id>
    <fullname>Adam Gilchrist</fullname>
    <majorteam>Australia</majorteam>
    <nickname>Gilchrist</nickname>
    <shortName>A Gilchrist</shortName>
    <dateofbirth>Nov 14, 1971</dateofbirth>
    <battingstyle>Left-hand bat</battingstyle>
    <bowlingstyle>Right-arm offbreak</bowlingstyle>
    <role>Wicket-Keeper</role>
    <teams-played-for>Western Australia, New South Wales, ICC World XI, Deccan Chargers, Australia</teams-played-for>
    <iplteam>Deccan Chargers</iplteam>
  </personal-information>
  <batting-statistics>
    <odi-stats>
      <matchtype>ODI</matchtype>
      <matches>287</matches>
      <innings>279</innings>
      <notouts>11</notouts>
      <runsscored>9619</runsscored>
      <highestscore>172</highestscore>
      <ballstaken>9922</ballstaken>
      <sixes>149</sixes>
      <fours>1000+</fours>
      <ducks>0</ducks>
      <fifties>55</fifties>
      <catches>417</catches>
      <stumpings>55</stumpings>
      <hundreds>16</hundreds>
      <strikerate>96.95</strikerate>
      <average>35.89</average>
    </odi-stats>
    <test-stats>
      .
      .
      .
    </test-stats>
    <t20-stats>
      .
      .
      .    
    </t20-stats>
    <ipl-stats>
      .
      .
      . 
    </ipl-stats>
  </batting-statistics>
  <bowling-statistics>
    <odi-stats>
      <matchtype>ODI</matchtype>
      <matches>378</matches>
      <ballsbowled>58</ballsbowled>
      <runsgiven>64</runsgiven>
      <wickets>3</wickets>
      <fourwicket>0</fourwicket>
      <fivewicket>0</fivewicket>
      <strikerate>19.33</strikerate>
      <economyrate>6.62</economyrate>
      <average>21.33</average>
    </odi-stats>
    <test-stats>
      .
      .
      . 
    </test-stats>
    <t20-stats>
      .
      .
      . 
    </t20-stats>
    <ipl-stats>
      .
      .
      . 
    </ipl-stats>
  </bowling-statistics>
</player-profile>

I am using 我在用

XmlNodeList list = _document.SelectNodes("/player-profile/batting-statistics/odi-stats");

And then loop this list with foreach as 然后使用foreach循环此列表

foreach (XmlNode stats in list)
  {
     _btMatchType = GetInnerString(stats, "matchtype"); //it returns null string if node not availible
     .
     .
     .
     .
     _btAvg = Convert.ToDouble(stats["average"].InnerText);
  }

Even i am loading all files offline, parsing is very slow Is there any good faster way to parse them? 即使我正在离线加载所有文件,解析也很慢有没有更快的解析方法? Or is it problem with SQL? 或者它是SQL的问题? I am saving all extracted data from XML to database using DataSets, TableAdapters with insert command. 我正在使用带有插入命令的DataSet,TableAdapters将所有提取的数据从XML保存到数据库。

EDIT: Now for using XmlReader please give some code of XmlReader for above document. 编辑:现在使用XmlReader请为上面的文档提供一些XmlReader代码。 for now, i have done this 现在,我做到了这一点

void Load(string url) 
{
    _reader = XmlReader.Create(url); 
    while (_reader.Read()) 
    { 
    } 
} 

Availible Methods for XmlReader are confusing. XmlReader的可用方法令人困惑。 What i need is to get batting and bowling stats completly, batting and bowling stats are different, while odi,t2o,ipl etc are same inside bowling and batting. 我需要的是完全击球和保龄球数据,击球和保龄球统计数据是不同的,而odi,t2o,ipl等在保龄球和击球中是相同的。

您可以使用XmlReader仅用于快进,快速阅读。

The overhead of throwing exceptions probably dwarfs the overhead of XML parsing. 抛出异常的开销可能使XML解析的开销相形见绌。 You need to rewrite your code so that it doesn't throw exceptions. 您需要重写代码,以便它不会抛出异常。

One way is to check for the existence of an element before you ask for its value. 一种方法是在询问元素之前检查元素是否存在。 That will work, but it's a lot of code. 这将有效,但它是很多代码。 Another way to do it would be to use a map: 另一种方法是使用地图:

Dictionary<string, string> map = new Dictionary<string, string>
{
  { "matchtype", null },
  { "matches", null },
  { "ballsbowled", null }
};

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (map.ContainsKey(elm.Name))
   {
      map[elm.Name] = elm.InnerText;
   }
}

This code will handle all the elements whose names you care about and ignore the ones you don't. 此代码将处理您关注其名称的所有元素,并忽略您不关注的元素。 If the value in the map is null, it means that an element with that name didn't exist (or had no text). 如果map中的值为null,则表示具有该名称的元素不存在(或没有文本)。

In fact, if you're putting the data into a DataTable , and the column names in the DataTable are the same as the element names in the XML, you don't even need to build a map, since the DataTable.Columns property is all the map you need. 事实上,如果你将数据放入一个DataTable ,并在列名DataTable是一样的XML元素的名称,你甚至都不需要建立一个地图,因为DataTable.Columns属性你需要的所有地图。 Also, since the DataColumn knows what data type it contains, you don't have to duplicate that knowledge in your code: 此外,由于DataColumn知道它包含哪种数据类型,因此您无需在代码中复制该知识:

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (myTable.Columns.Contains(elm.Name))
   {
      DataColumn c = myTable.Columns[elm.Name];
      if (c.DataType == typeof(string))
      {          
         myRow[elm.Name] = elm.InnerText;
         continue;
      }
      if (c.DataType == typeof(double))
      {
         myRow[elm.Name] = Convert.ToDouble(elm.InnerText);
         continue;
      }
      throw new InvalidOperationException("I didn't implement conversion logic for " + c.DataType.ToString() + ".");
   }
}

Note how I'm not declaring any variables to store this information in, so there's no chance of me screwing up and declaring a variable of a data type different from the column it's stored in, or creating a column in my table and forgetting to implement the logic that populates it. 请注意我没有声明任何变量来存储这些信息,所以我没有机会搞砸并声明一个与它存储的列不同的数据类型的变量,或者在我的表中创建一个列并忘记实现填充它的逻辑。

Edit 编辑

Okay, here's something that's a bit tricksy. 好的,这里的东西有点棘手。 This is a pretty common technique in Python; 这是Python中非常常见的技术; in C# I think most people still think there something weird about it. 在C#中我认为大多数人仍然认为它有些奇怪。

If you look at the second example I gave, you can see that it's using the metainformation in the DataColumn to figure out what logic to use for converting an element's value from text to its base type. 如果你看看我给出的第二个例子,你可以看到它正在使用DataColumn的元信息来确定用于将元素的值从文本转换为其基类型的逻辑。 You can accomplish the same thing by building your own map, eg: 您可以通过构建自己的地图来完成同样的事情,例如:

Dictionary<string, Type> typeMap = new Dictionary<string, Type>
{
   { "matchtype", typeof(string) },
   { "matches", typeof(int) },
   { "ballsbowled", typeof(int) }
}

and then do pretty much the same thing I showed in the second example: 然后做我在第二个例子中展示的相同的东西:

if (typeMap[elm.Name] == typeof(int))
{
   result[elm.Name] = Convert.ToInt32(elm.Text);
   continue;
}

Your results can no longer be a Dictionary<string, string> , since now they can contain things that aren't strings; 你的结果不再是Dictionary<string, string> ,因为现在它们可以包含不是字符串的东西; they have to be a Dictionary<string, object> . 它们必须是Dictionary<string, object>

But that logic seems a little ungainly; 但这种逻辑看起来有点笨拙; you're testing each item several times, there are continue statements to break out of it - it's not terrible, but it could be more concise. 你正在多次测试每个项目,有continue声明要突破它 - 它并不可怕,但它可能更简洁。 How? 怎么样? By using another map, one that maps types to conversion functions: 通过使用另一个映射,将类型映射到转换函数:

Dictionary<Type, Func<string, object>> conversionMap = 
   new Dictionary<Type, Func<string, object>>
{
   { typeof(string), (x => x) },
   { typeof(int), (x => Convert.ToInt32(x)) },
   { typeof(double), (x => Convert.ToDouble(x)) },
   { typeof(DateTime), (x => Convert.ToDateTime(x) }
};

That's a little hard to read, if you're not used to lambda expressions. 如果你不习惯lambda表达式,这有点难以阅读。 The type Func<string, object> specifies a function that takes a string as its argument and returns an object. 类型Func<string, object>指定一个函数,该函数将string作为其参数并返回一个对象。 And that's what the values in that map are: they're lambda expressions, which is to say functions. 这就是该映射中的值是什么:它们是lambda表达式,也就是说函数。 They take a string argument ( x ), and they return an object. 它们采用字符串参数( x ),然后返回一个对象。 (How do we know that x is a string? The Func<string, object> tells us.) (我们怎么知道x是一个字符串? Func<string, object>告诉我们。)

This means that converting an element can take one line of code: 这意味着转换元素可以占用一行代码:

result[elm.Name] = conversionMap[typeMap[elm.Name]](elm.Text);

Go from the inner to the outer expression: this looks up the element's type in typeMap , and then looks up the conversion function in conversionMap , and calls that function, passing it elm.Text as an argument. 从内去外表达:这个查找该元件的类型在typeMap ,然后查找在转换函数conversionMap ,并调用该函数,传递给它elm.Text作为参数。

This may not be the ideal approach in your case. 在您的情况下,这可能不是理想的方法。 I really don't know. 我真的不知道。 I show it here because there's a bigger issue at play. 我在这里展示它是因为有一个更大的问题在起作用。 As Steve McConnell points out in Code Complete , it's easier to debug data than it is to debug code. 正如Steve McConnell在Code Complete中指出的那样,调试数据比调试代码更容易。 This technique lets you turn program logic into data. 此技术允许您将程序逻辑转换为数据。 There are cases where using this technique vastly simplifies the structure of your program. 在某些情况下,使用此技术可以大大简化程序的结构。 It's worth understanding. 值得了解。

You could try LINQ to XML . 您可以尝试使用LINQ to XML Or you can use this to figure out what to use. 或者你可以用来弄清楚要使用什么。

If the documents are large, then a stream-based parser (which is fine for your needs) will be faster than using XmlDocument, mostly because of the lower overhead. 如果文档很大,那么基于流的解析器(适合您的需求)将比使用XmlDocument更快,主要是因为开销较低。 Check out the documentation for XmlReader. 查看XmlReader的文档。

I wouldn't say LINQ is the best approach. 我不会说LINQ是最好的方法。 I searched Google and I saw some references to HTML Agility Pack . 我搜索了谷歌,我看到了一些HTML Agility Pack的引用。

I think that if your going to have a speed bottleneck, it will be with your download process. 我认为如果你有速度瓶颈,那将是你的下载过程。 In other words, it appears that your performance problems are not with your XML code. 换句话说,您的性能问题似乎与XML代码无关。 I think there are ways to improve your download speeds maybe or your file i/o but I don't know what they would be. 我认为有一些方法可以提高你的下载速度或你的文件i / o,但我不知道它们会是什么。

If you know that the XML is consistent and well formed, you can simply avoid doing real XML parsing and just process them as flat text files. 如果您知道 XML是一致且格式良好的,那么您可以简单地避免执行真正的XML解析并将它们作为平面文本文件处理。 This is risky, non-portable, and brittle. 这是危险的,不便携的,易碎的。

But it'll be the fastest (to run, not to code) solution. 但它将是最快(运行,而不是代码)的解决方案。

An XmlReader is the solution for your problem. XmlReader是您的问题的解决方案。 An XmlDocument stores lots of meta-information making the Xml easy to access, but it becomes too heavy on memory. XmlDocument存储了许多元信息,使得Xml易于访问,但它在内存上变得过重。 I have seen some Xmls of size less than 50 KB being converted to few MBs (10 or something) of XmlDocument. 我已经看到一些大小小于50 KB的Xmls转换为XmlDocument的几个MB(10或者其他)。

If you are already converting that information into a DataSet to insert it into tables, just use DataSet.ReadXML() - and work with the default tables it creates from the data. 如果您已经将该信息转换为DataSet以将其插入表中,只需使用DataSet.ReadXML() - 并使用它从数据创建的默认表。

This toy app does that, and it works with the format you defined above. 这个玩具应用程序就是这样做的,它可以使用您在上面定义的格式。

Project file: http://www.dot-dash-dot.com/files/wtfxml.zip Installer: http://www.dot-dash-dot.com/files/WTFXMLSetup_1_8_0.msi 项目文件: http//www.dot-dash-dot.com/files/wtfxml.zip安装程序: http//www.dot-dash-dot.com/files/WTFXMLSetup_1_8_0.msi

It lets you browse edit your XML file using a tree and grid format - the tables listed in the grid are the ones automatically created by the DataSet after ReadXML(). 它允许您使用树和网格格式浏览编辑XML文件 - 网格中列出的表是在ReadXML()之后由DataSet自动创建的表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM