简体   繁体   中英

extract annotated text from text file by using java code

I have annotated text file on the following format:

<paragraph><weakness>Buffer</weakness> <weakness>Overflow</weakness>
in <location>client/mysql.cc</location> in <application>Oracle</application> 
<application>MySQL</application> and <application>MariaDB</application> 
<version>before</version> <version>5.2</version> <vulnerability>allows
</vulnerability> <vulnerability>remote</vulnerability> 
<application>database</application> <application>servers</application> 
...
...

What I would like to do is to create a Java code to parse the above text file and put it in the following format:

Buffer  weakness
overflow  weakness
in   O <--- 'O' means doesn't have annotation
Oracle  application
MySQL   application
...
...

I tried to tokenize the file, but the problem is, I will do parsing and formatting again, and I could lose some useful information!!

Please any help !!

You can use some XML Parsers that can parse your xml : eg: dom4j , XOM

Also you can use the Java Xpath Library provided in JDK version 1.5 and higher to extract the contents from XML if you know the XPATH for the elements that you are looking for. For eg : For extracting all weakness, you can just use the following XPATH : /paragraph/weakness

Choose the library that suits your purpose the most.

Split your text by spaces into a String array, then for each strings in the array, look after a "<" sign, if found, then parse it with Xpath, else write out the value and 0, as you need to.

...
String split[] = readLine.split("\\s");
for (String string : split) {
  if (string.indexOf("<") != -1) {
    //XPath parsing
  } else {
    System.out.println(string + " O");
  }
}
...

If the file is indeed well-formed XML (with balanced tags, all & characters escaped as &amp; , etc.) then this is pretty straightforward with an XSLT 2.0 transformation

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="text" />
  <!-- ignore text nodes that are _entirely_ whitespace -->
  <xsl:strip-space elements="*" />

  <xsl:template match="/">
    <xsl:apply-templates select="//paragraph//text()" />
  </xsl:template>

  <xsl:template match="text()">
    <!-- name of the element that contains this text node -->
    <xsl:param name="tag" select="local-name(..)"/>
    <!-- for each word in the text node -->
    <xsl:for-each select="tokenize(normalize-space(), ' ')">
      <!-- word-TAB-tag-NL -->
      <xsl:value-of select="concat(., '&#9;', $tag, '&#10;')" />
    </xsl:for-each>
  </xsl:template>

  <!-- special case for nodes directly under <paragraph> - use "O" -->
  <xsl:template match="paragraph/text()">
    <xsl:next-match>
      <xsl:with-param name="tag" select="'O'" />
    </xsl:next-match>
  </xsl:template>

</xsl:stylesheet>

You can run this from Java using Saxon 9 HE .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM