简体   繁体   中英

java Extracting Text from a korean RTF

Extracting text from a EULA file to display to a user using jsp. The text that is in the RTF file is along the lines of:

적용 범위. 본 최종 

However when I extract and print the text to the console I end up with a different format like so:

Àû¿ë ¹üÀ§. º» ÃÖ

I believe it has to do with the encoding, but files that contain English, Spanish, and Russian characters work fine. Why is it displaying these odd characters, and how to i get the expected output?

private static String rtfToHtml(Reader rtf, String contentType) throws IOException
    {
        final JEditorPane p = new JEditorPane();
        p.setContentType("text/rtf");
        EditorKit kitRtf = p.getEditorKitForContentType("text/rtf");
        try
        {
            kitRtf.read(rtf, p.getDocument(), 0);
            kitRtf = null;
            final EditorKit kitHtml = p.getEditorKitForContentType(contentType);
            final Writer writer = new StringWriter();
            //          writer.write("Content-Type: text/plain; charset=utf-8\n\n");
            kitHtml.write(writer, p.getDocument(), 0, p.getDocument().getLength());
            // Utf-8 encoding the string 

            return writer.toString();
        }
        catch (final BadLocationException e)
        {
            e.printStackTrace();
        }
        return null;
    }

public static String extractEulaToPlain(String eulaDocumentLocation) throws FileNotFoundException, IOException
    {
        final FileInputStream is = new FileInputStream(eulaDocumentLocation);
        final InputStreamReader isr = new InputStreamReader(is, "UTF-8");
        final BufferedReader buffReader = new BufferedReader(isr);

        final String plain = rtfToHtml(buffReader, "text/plain");

EDIT: (sample rtf file)

  {\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033\deflangfe3079{\fonttbl{\f0\fswiss\fprq2\fcharset0 Calibri;}{\f1\froman\fprq2\fcharset129 Batang;}{\f2\fnil\fcharset0 Malgun Gothic Bold;}{\f3\fswiss\fprq2\fcharset129 Malgun Gothic;}{\f4\froman\fprq2\fcharset0 Times New Roman;}{\f5\fnil\fcharset0 Calibri;}}
{\colortbl ;\red0\green0\blue0;}
{\*\generator Riched20 6.3.9600}\viewkind4\uc1 
\pard\nowidctlpar\cf1\f0\fs17\lang1042 1.\b\f1\'c0\'fb\'bf\'eb\f2  \f1\'b9\'fc\'c0\'a7\b0\f2 . \f3\'ba\'bb \'c3\'d6\'c1\'be \'bb\'e7\'bf\'eb\'c0\'da \'b6\'f3\'c0\'cc\'bc\'be\'bd\'ba \'b0\'e8\'be\'e0(\'c0\'cc\'c7\'cf "\'b0\'e8\'be\'e0")\'c0\'ba \'b5\'bf\'ba\'c0\'b5\'c8 \'bc\'d2\'c7\'c1\'c6\'ae\'bf\'fe\'be\'ee\'c0\'c7 \'bb\'e7\'bf\'eb\'bf\'a1 \'c0\'fb\'bf\'eb\'b5\'c7\'b8\'e7\f4 ,\cf0\fs24\par
\cf1\f3\fs17\'b1\'cd\'c7\'cf\'bf\'cd\'b9\'d7 \'c0\'da\'c8\'b8\'bb\'e7(\'c3\'d1\'c4\'aa\'b0\'a3\'bf\'a1 \'c3\'bc\'b0\'e1\'b5\'c8 \'ba\'b0\'b5\'b5 \'b0\'e8\'be\'e0\'c0\'c7\cf0\f4\fs24\par
\cf1\f3\fs17\'b1\'b8\'bc\'d3\'c0\'bb \'b9\'de\'b4\'c2 \'b0\'e6\'bf\'ec\'b4\'c2 \'c1\'a6\'bf\'dc\'b5\'cb\'b4\'cf\'b4\'d9. \'b1\'cd\'c7\'cf\'b0\'a1 \'bc\'d2\'c7\'c1\'c6\'ae\'bf\'fe\'be\'ee\'b8\'a6 \'b4\'d9\'bf\'ee\'b7\'ce\'b5\'e5\'c7\'cf\'b0\'c5\'b3\'aa, \'ba\'b9\'bb\'e7\'c7\'cf\'b0\'c5\'b3\'aa, \'bb\'e7\'bf\'eb\'c7\'cf\'b4\'c2\cf0\f4\fs24\par
\cf1\f3\fs17\'b0\'e6\'bf\'ec \'ba\'bb \'b0\'e8\'be\'e0\'bf\'a1 \'b5\'bf\'c0\'c7\'c7\'cf\'b4\'c2 \'b0\'cd\'c0\'b8\'b7\'ce \'b0\'a3\'c1\'d6\'b5\'cb\'b4\'cf\'b4\'d9. HPE\'b4\'c2 \'ba\'bb \'b0\'e8\'be\'e0\'c0\'bb \'bf\'b5\'be\'ee \'c0\'cc\'bf\'dc\'c0\'c7 \'c6\'af\'c1\'a4 \'be\'f0\'be\'ee\'b7\'ce\cf0\f4\fs24\par
\cf1\f3\fs17\'b9\'f8\'bf\'aa\'c7\'cf\'bf\'a9 \'b4\'d9\'c0\'bd \'c0\'a7\'c4\'a1\'bf\'a1\'bc\'ad \'c1\'a6\'b0\'f8\'c7\'d5\'b4\'cf\'b4\'d9\cf0\f5\fs22\lang9\par
}

I used RTF Parser Kit to perform the conversion. Here's the converted text from your sample RTF file:

1.적용 범위. 본 최종 사용자 라이센스 계약(이하 "계약")은 동봉된 소프트웨어의 사용에 적용되며, 귀하와및 자회사(총칭간에 체결된 별도 계약의 구속을 받는 경우는 제외됩니다. 귀하가 소프트웨어를 다운로드하거나, 복사하거나, 사용하는 경우 본 계약에 동의하는 것으로 간주됩니다. HPE는 본 계약을 영어 이외의 특정 언어로 번역하여 다음 위치에서 제공합니다

That certainly looks more promising than the output you were getting!

You can use RTF Parser Kit working with streams:

new StreamTextConverter().convert(new RtfStreamSource(inputStream), outputStream, "UTF-8");

or as a convenience there is a converter provided which provides the output as a string:

StringTextConverter converter = new StringTextConverter();
converter.convert(new RtfStreamSource(inputStream));
String extractedText = converter.getText();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM