简体   繁体   中英

Extract bullets from word document using aspose.words in C#

I need to extract the text with the bullet style from a word document in C#. I am using the aspose.words library but a solution with a different library is also welcome. I can already upload documents and extract the text with heading1 styling. but when I try the same with the bullet styling I get nothing.

I am using the code below to get the text with Heading1 styling and that works.

var heading1 = doc
    .GetChildNodes(NodeType.Paragraph, true)
    .Cast<Aspose.Words.Paragraph>()
    .ToArray()
    .Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1);
    
foreach (var head1 in heading1)
{
    listBox11.Items.Add(head1.gettext()tostring());
}

I am trying to use the code below to get the text with bullet styling and this does NOT work.

var bullets = doc
    .GetChildNodes(NodeType.Paragraph, true)
    .Cast<Aspose.Words.Paragraph>()
    .ToArray()
    .Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.ListBullet);
    
foreach (var bullet in bullets)
{
    listBox19.Items.Add(bullet.GetText().ToString());
}
    
listBox19.Items.Add(bullet1.GetText().ToString());

I also tried using the listbullet1,2,3,4 and 5 styleIdentifiers but that also does not fix the problem.

Most likely your code does not work because bullets are not applied via style. In MS Word document there are several levels where you can apply formatting: Document defaults, Theme, Style and direct formatting. In your case, I think, the best way is to use ListFormat.IsListItem property.

I am now using this to succesfully extract the list items from a word file and put them into a listbox.

       string fileName = listBox1.Items.Cast<string>().FirstOrDefault();
                // Open the document.
                Document doc = new Document(fileName);

                doc.UpdateListLabels();

                NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);

                // Find if we have the paragraph list. In our document, our list uses plain Arabic numbers,
                // which start at three and ends at six.
                foreach (Aspose.Words.Paragraph paragraph in paras.OfType<Aspose.Words.Paragraph>().Where(p => p.ListFormat.IsListItem))
                {
                    //listBox19.Items.Add($"List item paragraph #{paras.IndexOf(paragraph)}");

                    // This is the text we get when getting when we output this node to text format.
                    // This text output will omit list labels. Trim any paragraph formatting characters. 
                    string paragraphText = paragraph.ToString(SaveFormat.Text).Trim();
                    //remove the dot in front of the bullet
                    string bullet = paragraphText.Remove(0, 2);

                    listBox19.Items.Add(bullet);

                    ListLabel label = paragraph.ListLabel;
                }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM