简体   繁体   中英

How to extract rtf tables

I have an rtf file. It has lots of tables in it. I have been trying to use java (POI and tika) to extract the tables. This is easy enough in a .doc where the tables are defined as such. However in a rtf file there doesn't seem to be any 'this is a table' tag as part of the meta data. Does anyone know what the best strategy is for extracting a table from such a file? Would converting it to another file format help. Any clues for me to look up?

There is a linux tool called unrtf, look at manual

With the app you can transform your rtf file into html:

unrtf --html your_input_file.rtf > your_output_file.html

Now you can use any programming api for manipulation of html/xml and extract tables easily. Is it enough you need?

Thanks hexin for your answer. In the end I was able to use Tika by using the TXTParser and then putting all the segments between bold tags(which is how my tables are separated) into an arraylist. I had to use the tab seperators to define tables from there. Here is the code without the bit to extract the tables based on tabs (still working on it):

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.parser.rtf.RTFParser;
import org.apache.tika.parser.txt.TXTParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;


public class TextParser {
public static void main(final String[] args) throws IOException,TikaException{
 //detecting the file type
 BodyContentHandler handler = new BodyContentHandler(-1);
 Metadata metadata = new Metadata();

 FileInputStream inputstream = new FileInputStream(new File("/Users/mydoc.rtf"));
 ParseContext pcontext = new ParseContext();

 //Text document parser
 TXTParser TXTParser = new TXTParser();
 try {
     TXTParser.parse(inputstream, handler, metadata,pcontext);

} catch (SAXException e) {

    e.printStackTrace();
} 
 String s=handler.toString();

Pattern pattern = Pattern.compile("(\\\\b\\\\f1\\\\fs24.+?\\\\par .+?)\\\\b\\\\f1\\\\fs24.*?\\{\\\\",Pattern.DOTALL);

Matcher matcher = pattern.matcher(s);
ArrayList<String> arr= new ArrayList<String>();

while (matcher.find()) {
       arr.add(matcher.group(1));
     }

 for(String name : arr){
     System.out.println("The array number is: "+arr.indexOf(name)+" \n\n "+name);
 }

 }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM