I have an rtf file. It has lots of tables in it. I have been trying to use java (POI and tika) to extract the tables. This is easy enough in a .doc where the tables are defined as such. However in a rtf file there doesn't seem to be any 'this is a table' tag as part of the meta data. Does anyone know what the best strategy is for extracting a table from such a file? Would converting it to another file format help. Any clues for me to look up?
There is a linux tool called unrtf, look at manual
With the app you can transform your rtf file into html:
unrtf --html your_input_file.rtf > your_output_file.html
Now you can use any programming api for manipulation of html/xml and extract tables easily. Is it enough you need?
Thanks hexin for your answer. In the end I was able to use Tika by using the TXTParser and then putting all the segments between bold tags(which is how my tables are separated) into an arraylist. I had to use the tab seperators to define tables from there. Here is the code without the bit to extract the tables based on tabs (still working on it):
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.parser.rtf.RTFParser;
import org.apache.tika.parser.txt.TXTParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TextParser {
public static void main(final String[] args) throws IOException,TikaException{
//detecting the file type
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("/Users/mydoc.rtf"));
ParseContext pcontext = new ParseContext();
//Text document parser
TXTParser TXTParser = new TXTParser();
try {
TXTParser.parse(inputstream, handler, metadata,pcontext);
} catch (SAXException e) {
e.printStackTrace();
}
String s=handler.toString();
Pattern pattern = Pattern.compile("(\\\\b\\\\f1\\\\fs24.+?\\\\par .+?)\\\\b\\\\f1\\\\fs24.*?\\{\\\\",Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
ArrayList<String> arr= new ArrayList<String>();
while (matcher.find()) {
arr.add(matcher.group(1));
}
for(String name : arr){
System.out.println("The array number is: "+arr.indexOf(name)+" \n\n "+name);
}
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.