简体   繁体   English

使用Java代码从文本文件中提取带注释的文本

[英]extract annotated text from text file by using java code

I have annotated text file on the following format: 我有以下格式的注释文本文件:

<paragraph><weakness>Buffer</weakness> <weakness>Overflow</weakness>
in <location>client/mysql.cc</location> in <application>Oracle</application> 
<application>MySQL</application> and <application>MariaDB</application> 
<version>before</version> <version>5.2</version> <vulnerability>allows
</vulnerability> <vulnerability>remote</vulnerability> 
<application>database</application> <application>servers</application> 
...
...

What I would like to do is to create a Java code to parse the above text file and put it in the following format: 我想做的是创建一个Java代码来解析上述文本文件,并将其放入以下格式:

Buffer  weakness
overflow  weakness
in   O <--- 'O' means doesn't have annotation
Oracle  application
MySQL   application
...
...

I tried to tokenize the file, but the problem is, I will do parsing and formatting again, and I could lose some useful information!! 我试图对文件进行标记化,但是问题是,我将再次进行解析和格式化,并且可能会丢失一些有用的信息!

Please any help !! 请任何帮助!

You can use some XML Parsers that can parse your xml : eg: dom4j , XOM 您可以使用一些可以解析xml的XML解析器:例如: dom4jXOM

Also you can use the Java Xpath Library provided in JDK version 1.5 and higher to extract the contents from XML if you know the XPATH for the elements that you are looking for. 如果知道要查找的元素的XPATH,也可以使用JDK 1.5版和更高版本中提供的Java Xpath库从XML中提取内容。 For eg : For extracting all weakness, you can just use the following XPATH : /paragraph/weakness 例如:要提取所有弱点,您只需使用以下XPATH: /paragraph/weakness

Choose the library that suits your purpose the most. 选择最适合您目的的库。

Split your text by spaces into a String array, then for each strings in the array, look after a "<" sign, if found, then parse it with Xpath, else write out the value and 0, as you need to. 用空格将文本分割成一个String数组,然后为该数组中的每个字符串寻找一个“ <”符号(如果找到),然后用Xpath解析它,否则根据需要写出值和0。

...
String split[] = readLine.split("\\s");
for (String string : split) {
  if (string.indexOf("<") != -1) {
    //XPath parsing
  } else {
    System.out.println(string + " O");
  }
}
...

If the file is indeed well-formed XML (with balanced tags, all & characters escaped as &amp; , etc.) then this is pretty straightforward with an XSLT 2.0 transformation 如果文件确实是格式正确的XML(带有平衡的标签,所有&字符转义为&amp;等),那么使用XSLT 2.0转换就非常简单

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="text" />
  <!-- ignore text nodes that are _entirely_ whitespace -->
  <xsl:strip-space elements="*" />

  <xsl:template match="/">
    <xsl:apply-templates select="//paragraph//text()" />
  </xsl:template>

  <xsl:template match="text()">
    <!-- name of the element that contains this text node -->
    <xsl:param name="tag" select="local-name(..)"/>
    <!-- for each word in the text node -->
    <xsl:for-each select="tokenize(normalize-space(), ' ')">
      <!-- word-TAB-tag-NL -->
      <xsl:value-of select="concat(., '&#9;', $tag, '&#10;')" />
    </xsl:for-each>
  </xsl:template>

  <!-- special case for nodes directly under <paragraph> - use "O" -->
  <xsl:template match="paragraph/text()">
    <xsl:next-match>
      <xsl:with-param name="tag" select="'O'" />
    </xsl:next-match>
  </xsl:template>

</xsl:stylesheet>

You can run this from Java using Saxon 9 HE . 您可以使用Saxon 9 HE从Java运行此程序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM