(JAVA)Finding a substring in a string which is in UTF-8 encoded format

Question

Say we have a main string contains some text which is in UTF-8 and another string which is a word and this will be in UTF-8 format as well.So please help me to do this in Java.Thank you.

import java.awt.Component;
import java.io.File;
import javax.swing.JFileChooser;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;



public class Example {
     private static Component frame;
     public static void main(String args[]) throws FileNotFoundException, IOException{
         JFileChooser fc = new JFileChooser();
         int returnVal = fc.showOpenDialog(frame); //Where frame is the parent component

         File file = null;
         if (returnVal == JFileChooser.APPROVE_OPTION) {
         file = fc.getSelectedFile();
         //Now you have your file to do whatever you want to do
         String str = file.getName();
         str = "c:\\" + str; 
         BufferedReader in = new BufferedReader(new InputStreamReader(new                         FileInputStream(str),"UTF8"));
         String line;
         String wordfname = "c:\\word.txt";
         BufferedReader innew = new BufferedReader(new InputStreamReader(new FileInputStream(wordfname),"UTF8"));
         String word;
         word = innew.readLine();
         System.out.println(word);
         File fileDir = new File("c:\\test.txt");
         Writer out = new BufferedWriter(new OutputStreamWriter(new               FileOutputStream(fileDir), "UTF8"));
         while((line = in.readLine()) != null)
         {
          System.out.println(line);
          out.append(line).append("\r\n");
          boolean r = line.contains(word);
          System.out.println(r);
          }
         out.flush();
         out.close();
         System.out.println(str);

} 
 else {
//User did not choose a valid file
 }
    }

}

Link to the two files are: https://www.dropbox.com/s/4ej0hii6gnlwtga/kannada.txt and https://www.dropbox.com/s/emncfr7bsi8mvwn/word.txt

Answer 1

In fact you did everything fine, apart from some UTF-8 details. Java Reader/Writer/String handle Unicode.

(Please close the readers too, and flush before close is not needed.)

There is one thing: zero-width combining diacritical marks. Small c-circumflex, ĉ , is one character in the Unicode table, code-point U+0109, java "\ĉ", but can also be two Unicode code-points: c , plus a zero-width ^ , "e\̂".

There exists a text normalization in java which transforms into a specific form.

String cCircumflex = "\u0109"; // c^
String cWithCircumflex = "c\u0302"; // c^

String cx = Normalizer.normalize(cCircumflex, Normalizer.Form.NFKC);
String cx2 = Normalizer.normalize(cWithCircumflex, Normalizer.Form.NFKC);
assert cx.equals(cx2);

Which normalisation to chose from is more or less irrelevant, composition ( ...C ) seeming most natural (and gives better font rendering), but decomposition ...D allows natural sorting to be "aäá...cĉ...eé...".

You could even search words, with diacritical marks removed (cafe versus café):

word = Normalizer.normalize(word, Normalizer.Form.NFKD); // Decompose.
word = word.replaceAll("\\p{M}", ""); // Remove diacriticals.
word = word.replaceAll("\\p{C}", ""); // Optional: invisible control characters.

After running the original code

It seems to work with me, without any change (Java 8). Though I had to put kannada.txt on C:\\ .

ಅದರಲ್ಲಿ
್ರಪಂಚದಲ್ಲಿ ಅನೇಕ ಮಾಧ್ಯಮಗಳು ಇದೆ. ಆಕಾಶವಾಣಿ, ದೂರದರ್ಶನ, ವಾರ್ತಾ ಪತ್ರಿಕೆ ಮುಂತಾದವು ಅದರಲ್ಲಿ ದೂರದರ್ಶನಪ ಪ್ರಮುಖವಾದ ಕಾರ್ಯವನ್ನು ಹೊಂದಿದ್ದು  ಅದನ್ನು ಚಿಕ್ಕವರಿಂದ ಹಿಡಿದು ದೊಡ್ಡವರವರೆಗೂ ನೋಡುತ್ತಾರೆ. ಇದಕ್ಕೆ ಇಂಗ್ಲೀಷ್‌ನಲ್ಲಿ ಟೆಲಿವಿಷನ್ ಎಂದು ಚಿಕ್ಕದಾಗಿ ಟಿ.ವಿ. ಎಂದು ಕರೆಯುವ ಬದಲು ಟಿ.ಕೆ. ಎಂದು  ಕರೆಯಬೇಕಾಗಿತ್ತು. ಏಕೆಂದರೆ ಇದು ಟೆಲಿವಿಷನ್ ಅಷ್ಟೇ ಅಲ್ಲ ಟೈಮ್ ಕಿಲ್ಲರ್ ಕೂಡ. ಇದನ್ನು ಪ್ರಮುಖವಾಗಿ ವಯಸ್ಸಾದವರು ನೋಡುತ್ತಾರೆ. ಆದರೆ ಕೆಲಸಕ್ಕೆ ಬಂದ  ಕೆಲಸದವರು ತಾವು ಕೆಲಸ ಮಾಡುವ ಬದಲು ಮನೆಯಲ್ಲಿ ಕುಳಿತು ನೋಡುತ್ತಾರೆ. 
true

false
ನನ್ನ ಪ್ರಕಾರ ಹೇಳಬೇಕಾದರೆ ಡಾಕ್ಷರ್‌ಗಳಿಗೆ ದುಡ್ಡು ಕೊಡುವ ಮಹಾಲಕ್ಷ್ಮಿ ಈ ಟಿ.ವಿ. 
false
c:\kannada.txt

Answer 2

String objects actually have fixed UTF-16 encoding.

byte[] has technically no encoding. but you can attach an encoding to byte[] tough. so if you need UTF-8 encoded data, you need a byte[].

so my approach would be

byte[] text = String.getBytes("UTF-8");

to get an UTF-8 byte[]..

IMHO but findeing a substring in a string (which is fully UTF-16!) which is UTF-8 encoded is senseless :)

Answer 3

Thank you all for your help. Now i'm able to find the substring.It worked when i made the text to be on next line in word.txt file and read that word in second readLine() statement.

(JAVA)Finding a substring in a string which is in UTF-8 encoded format

Question

3 answers

solution1
1 2014-01-22 18:02:53

solution2
0 2014-01-22 16:56:05

solution3
0 2014-01-23 04:51:10

(JAVA)Finding a substring in a string which is in UTF-8 encoded format

Question

3 answers

solution1 1 2014-01-22 18:02:53

solution2 0 2014-01-22 16:56:05

solution3 0 2014-01-23 04:51:10

solution1
1 2014-01-22 18:02:53

solution2
0 2014-01-22 16:56:05

solution3
0 2014-01-23 04:51:10