简体   繁体   中英

Extract text between 2 strings from word document using aspose.words in C#

I have a word document from which I need to extract a few lines of text. the text i need to extract can be found in between the two strings: “must haves” and “could haves”. Does anyone know what I should do to achieve this?

You can use IReplacingCallback to achieve what you need. For example see the following code:

Document doc = new Document(@"C:\temp\in.docx");
FindReplaceOptions opt = new FindReplaceOptions();
opt.ReplacingCallback = new MyReplacingCallback();
Regex regex = new Regex(@"\<mytag\>(.*?)\<\/mytag\>");
doc.Range.Replace(regex, "", opt);
private class MyReplacingCallback : IReplacingCallback
{
    public ReplaceAction Replacing(ReplacingArgs args)
    {
        Console.WriteLine(args.Match.Groups[1].Value);
        return ReplaceAction.Skip;
    }
}

use tika to extract text from docx... : https://www.nuget.org/packages/TikaOnDotNet.TextExtractor

var str = new TikaOnDotNet.TextExtraction.TextExtractor().Extract(@"C:\Users\Inconnu\Downloads\test.docx").Text;

            int pForm = str.IndexOf("must haves") + "must haves".Length;
            int pTo = str.LastIndexOf("could haves");

            string result = str.Substring(pForm, pTo - pForm);
        

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM