Search multiple XML files for string

Question

I have a folder with 400k+ XML-documents and many more to come, each file is named with 'ID'.xml, and each belongs to a specific user. In a SQL server database I have the 'ID' from the XML-file matched with a userID which is where I interconnect the XML-document with the user. A user can have an infinite number of XML-document attached (but let's say maximum >10k documents)

All XML-documents have a few common elements, but the structure can vary a little.

Now, each user will need to make a search in the XML-documents belonging to her, and what I've tried so far (looping through each file and read it with a streamreader) is too slow. I don't care, if it reads and matches the whole file with attributes and so on, or just the text in each element. What should be returned in the first place is a list with the ID's from the filenames.

What is the fastest and smartest methods here, if any?

Answer 1

I think LINQ-to-XML is probably the direction you want to go.

Assuming you know the names of the tags that you want, you would be able to do a search for those particular elements and return the values.

var xDoc = XDocument.Load("yourFile.xml");

var result = from dec in xDoc.Descendants()
             where dec.Name == "tagName"
             select dec.Value;

results would then contain an IEnumerable of the value of any XML tag that has has a name matching "tagName"

The query could also be written like this:

var result = from dec in xDoc.Decendants("tagName")
             select dec.Value;

or this:

var result = xDoc.Descendants("tagName").Select(tag => tag.Value);

The output would be the same, it is just a different way to filter based on the element name.

Answer 2

You'll have to open each file that contains relevant data, and if you don't know which files contain it, you'll have to open all that may match. So the only performance gain would be in the parsing routine.

When parsing Xml, if speed is the requirement, you could use the XmlReader as it performs way better than the other parsers (most read the entire Xml file before you can query them). The fact that it is forward-only should not be a limitation for this case.

If parsing takes about as long as the disk I/O, you could try parsing files in parallel, so one thread could wait for a file to be read while the other parses the loaded data. I don't think you can make that big a win there, though.

Also what is "too slow" and what is acceptable? Would this solution of many files become slower over time?

Answer 3

Use LINQ to XML.

Check out this article. over at msdn.

XDocument doc = XDocument.Load("C:\file.xml");

And don't forget that reading so many files will always be slow, you may try writing a multi-threaded program...

Answer 4

If I understood correctly you don't want to open each xml file for particular user because it's too slow whether you are using linq to xml or some other method. Have you considered saving some values both in xml file and relational database (tags) (together with xml ID). In that case you could search for some values in DB first and select only xml files that contain searched values ?

for example: ID, tagName1, tagName2 xmlDocID, value1, value2

my other question is, why have you chosen to store xml documents in file system. If you are using SQL Server 2005/2008, it has very good support for storing, searching through xml columns (even indexing some values in xml)

Answer 5

Are you just looking for files that have a specific string in the content somewhere ?

WARNING - Not a pure .NET solution. If this scares you, then stick with the other answers. :)

If that's what you're doing, another alternative is to get something like grep to do the heavy lifting for you. Shell out to that with the "-l" argument to specify that you are only interested in filenames and you are on to a winner. (for more usage examples, see this link )

Answer 6

LB Have already made a valid point. This is a case, where Lucene.Net(or any indexer) would be a must. It would give you a steady (very fast) performance in all searches. And it is one of the primary benefits of indexers, to handle a very large amount of arbitrary data.

Or is there any reason, why you wouldn't use Lucene?

Answer 7

Lucene.NET (and Lucene) support incremental indexing. If you can re-open the index for reading every so often, then you can keep adding documents to the index all day long -- your searches will be up-to-date with the last time you re-opened the index for searching.

Search multiple XML files for string

Question

7 answers

solution1
2 ACCPTED 2012-05-09 12:17:43

solution2
2 2012-05-09 12:18:10

solution3
1 2012-05-09 12:10:51

solution4
1 2012-05-09 12:31:08

solution5
0 2012-05-09 12:26:34

solution6
0 2012-05-09 12:47:28

solution7
0 2012-05-11 12:20:16

Search multiple XML files for string

Question

7 answers

solution1 2 ACCPTED 2012-05-09 12:17:43

solution2 2 2012-05-09 12:18:10

solution3 1 2012-05-09 12:10:51

solution4 1 2012-05-09 12:31:08

solution5 0 2012-05-09 12:26:34

solution6 0 2012-05-09 12:47:28

solution7 0 2012-05-11 12:20:16

solution1
2 ACCPTED 2012-05-09 12:17:43

solution2
2 2012-05-09 12:18:10

solution3
1 2012-05-09 12:10:51

solution4
1 2012-05-09 12:31:08

solution5
0 2012-05-09 12:26:34

solution6
0 2012-05-09 12:47:28

solution7
0 2012-05-11 12:20:16