简体   繁体   中英

How to parse text from MS Word document to string

I am trying to find a way to parse a word document's text to a string in my project.I have more than 600 word(.doc) files that I need to get the text content(with the new lines and tabs if possible) and assign it to a string for each one.

I've been reading stuff about the Open XML SDK but it looks quite complicated for something that looks so simple.

Open XML SDK is only for 2007 and newer formats and it is not trivial to use.

If performance is not an issue you could use Word Automation and have Word do this for you. It will look something like this:

var app = new Application();
var doc = app.Documents.Open(documentLocation);

string rangeText = doc.Range().Text;

doc.Save();
doc.Close();

Marshal.ReleaseComObject(doc);    
Marshal.ReleaseComObject(app);

Take a look at http://www.codeproject.com/Articles/18703/Word-2007-Automation or http://www.codeproject.com/Articles/21247/Word-Automation for more complete examples and instructions. Note that this may become a bit more tricky if your documents are move complex (footnotes, text boxes, tables...).

Another option is have word save the document as a text and then read the text file. Take a look at this - http://msdn.microsoft.com/en-us/library/microsoft.office.tools.word.document.saveas(v=vs.80).aspx

You could give a look at NPOI :

This project is the .NET version of POI Java project at http://poi.apache.org/ . POI is an open source project which can help you read/write xls, doc, ppt files. It has a wide application.

Take a look at this previous SO thread for more information.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM