简体   繁体   English

如何在c#中从word文档.doc或.docx中获取两个标题之间的所有文本

[英]How to get all text between two Headings from a word Document .doc or .docx in c#

How can I get all the Text between two headings or a text under specific heading?如何获取两个标题之间的所有文本或特定标题下的文本? Like..喜欢..

"Heading ABC" “标题ABC”

"Heading XYZ" “标题XYZ”
This is the content under XYZ heading这是XYZ标题下的内容
Test..测试..

"Sub heading or heading 2 of XYZ" “XYZ 的副标题或标题 2”
XYZ heading continue XYZ 航向继续

"Heading 123" Content under heading 123 “标题 123” 标题 123下的内容

I want to get all the content of XYZ heading including sub heading until next heading 123 appears.. How do I find that specific heading then fetch all the content under that heading in c#?我想获取 XYZ 标题的所有内容,包括子标题,直到出现下一个标题 123。我如何找到该特定标题,然后在 c# 中获取该标题下的所有内容? File could be .doc or .docx文件可以是 .doc 或 .docx

You can use NPOI library to read word documents.您可以使用NPOI库阅读 Word 文档。 Some sample code to get you started.一些示例代码可以帮助您入门。

public string ReadAllTextFromWordDocFile(string fileName)
{
    using (StreamReader streamReader = new StreamReader(fileName))
    {
        var document = new HWPFDocument(streamReader.BaseStream);
        var wordExtractor = new WordExtractor(document);
        var docText = new StringBuilder();
        foreach (string text in wordExtractor.ParagraphText)
        {
            docText.AppendLine(text.Trim());
        }
        streamReader.Close();
        return docText.ToString();
    }
}

Play around a little.稍微玩一下。

You also want to take a look at DocX .您还想看看DocX Basic examples here .基本示例在这里 MagicText property of every paragraph might help you identify titles.每个段落的MagicText属性可能会帮助您识别标题。

 private void DocReader(string fileLocation,string headingText, string headingStyle)
    {
        Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
        object miss = System.Reflection.Missing.Value;
        object path = fileLocation;
        object readOnly = true;
        Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
        string totaltext = "";
        int ind = 0;
        bool flag = false;

        int paraCount = docs.Paragraphs.Count;
        for (int i = 1; i < paraCount; i++)
        {
            Microsoft.Office.Interop.Word.Style style = docs.Paragraphs[i].get_Style() as Microsoft.Office.Interop.Word.Style;
            if (style != null && style.NameLocal.Equals(headingStyle))
            {
                flag = false;
                if (docs.Paragraphs[i].Range.Text.ToString().TrimEnd('\r').ToUpper() == headingText.ToUpper())
                {
                    ind++;
                    flag = true;
                }
            }
            if (flag && ind>=1)
                totaltext += " \r\n " + docs.Paragraphs[i].Range.Text.ToString();

        }
        if (totaltext == "") { totaltext = "No such data found!"; }
        richTextBox1.Text = totaltext;
        docs.Close();
        word.Quit();  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM