从 Java 中的 XML 文件中解析文本值

Question

So right now I am using the SAX parser in Java to parse the "document.xml" file located within a.docx file's archive.所以现在我正在使用 Java 中的 SAX 解析器来解析位于 a.docx 文件存档中的“document.xml”文件。 Below is a sample of what I am trying to parse...以下是我要解析的示例...

Sample XML Document样品 XML 文档

<w:pStyle w:val="Heading2" /> 
  </w:pPr>
  <w:bookmarkStart w:id="0" w:name="_Toc258435889" /> 
  <w:bookmarkStart w:id="1" w:name="_Toc259085121" /> 
  <w:bookmarkStart w:id="2" w:name="_Toc259261685" /> 
- <w:r w:rsidRPr="00415FD6">
  <w:t>Text To Extract</w:t> 
  </w:r>
  <w:bookmarkEnd w:id="0" /> 
  <w:bookmarkEnd w:id="1" /> 
  <w:bookmarkEnd w:id="2" />

Right now, I know how to take out attribute values, that's not hard.现在，我知道如何取出属性值，这并不难。 However, I do not know how to get in and parse the actual text within the nodes.但是，我不知道如何进入并解析节点内的实际文本。 Does anyone have any ideas or prior experience with this?有没有人对此有任何想法或经验？ Thank you in advance.先感谢您。

Answer 1

Read this article on SAX parsing (it is old but still valid), pay particular attention to how the characters method is implemented.阅读这篇关于 SAX 解析的文章（它很旧但仍然有效），特别注意characters方法是如何实现的。 It is very unintuitive and trips everybody up, you will get multiple calls to characters for what seems like no good reason.这是非常不直观的，并且会绊倒每个人，您会因为似乎没有充分理由而多次致电characters 。

Also the Java tutorial on SAX has a short explanation of the characters method:此外，关于 SAX 的 Java 教程对字符方法进行了简短说明：

Parsers are not required to return any particular number of characters at one time.解析器不需要一次返回任何特定数量的字符。 A parser can return anything from a single character at a time up to several thousand and still be a standard-conforming implementation.解析器一次可以返回从单个字符到数千个字符的任何内容，并且仍然是符合标准的实现。 So if your application needs to process the characters it sees, it is wise to have the characters() method accumulate the characters in a java.lang.StringBuffer and operate on them only when you are sure that all of them have been found.因此，如果您的应用程序需要处理它看到的字符，明智的做法是让 characters() 方法将字符累积在 java.lang.StringBuffer 中，并仅在您确定所有字符都已找到时才对它们进行操作。

In your case (XML with no mixed-content) that means storing the results of multiple characters() calls until the next call to endElement.在您的情况下（没有混合内容的 XML），这意味着存储多个 characters() 调用的结果，直到下一次调用 endElement。

Answer 2

See the characters() ContentHandler method.请参阅 characters() ContentHandler 方法。 Read the javadoc carefully - you can get multiple calls when you might expect only one.仔细阅读 javadoc - 当您可能只期望一个时，您可能会收到多个调用。

从 Java 中的 XML 文件中解析文本值

问题描述

2 个解决方案

解决方案1
3 已采纳 2011-07-05 20:17:59

解决方案2
2 2011-07-05 19:39:10

从 Java 中的 XML 文件中解析文本值

问题描述

2 个解决方案

解决方案1 3 已采纳 2011-07-05 20:17:59

解决方案2 2 2011-07-05 19:39:10

解决方案1
3 已采纳 2011-07-05 20:17:59

解决方案2
2 2011-07-05 19:39:10