简体   繁体   English

(JAVA)查找采用UTF-8编码格式的字符串中的子字符串

[英](JAVA)Finding a substring in a string which is in UTF-8 encoded format

Say we have a main string contains some text which is in UTF-8 and another string which is a word and this will be in UTF-8 format as well.So please help me to do this in Java.Thank you. 假设我们有一个主字符串,其中包含一些以UTF-8格式的文本和另一个字符串为一个单词,并且也将采用UTF-8格式,所以请帮助我用Java进行此操作,谢谢。

import java.awt.Component;
import java.io.File;
import javax.swing.JFileChooser;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;



public class Example {
     private static Component frame;
     public static void main(String args[]) throws FileNotFoundException, IOException{
         JFileChooser fc = new JFileChooser();
         int returnVal = fc.showOpenDialog(frame); //Where frame is the parent component

         File file = null;
         if (returnVal == JFileChooser.APPROVE_OPTION) {
         file = fc.getSelectedFile();
         //Now you have your file to do whatever you want to do
         String str = file.getName();
         str = "c:\\" + str; 
         BufferedReader in = new BufferedReader(new InputStreamReader(new                         FileInputStream(str),"UTF8"));
         String line;
         String wordfname = "c:\\word.txt";
         BufferedReader innew = new BufferedReader(new InputStreamReader(new FileInputStream(wordfname),"UTF8"));
         String word;
         word = innew.readLine();
         System.out.println(word);
         File fileDir = new File("c:\\test.txt");
         Writer out = new BufferedWriter(new OutputStreamWriter(new               FileOutputStream(fileDir), "UTF8"));
         while((line = in.readLine()) != null)
         {
          System.out.println(line);
          out.append(line).append("\r\n");
          boolean r = line.contains(word);
          System.out.println(r);
          }
         out.flush();
         out.close();
         System.out.println(str);

} 
 else {
//User did not choose a valid file
 }
    }

} }

Link to the two files are: https://www.dropbox.com/s/4ej0hii6gnlwtga/kannada.txt and https://www.dropbox.com/s/emncfr7bsi8mvwn/word.txt 链接到两个文件是: https : //www.dropbox.com/s/4ej0hii6gnlwtga/kannada.txthttps://www.dropbox.com/s/emncfr7bsi8mvwn/word.txt

In fact you did everything fine, apart from some UTF-8 details. 实际上,除了一些UTF-8细节之外,您所做的一切都很好。 Java Reader/Writer/String handle Unicode. Java Reader / Writer / String处理Unicode。

(Please close the readers too, and flush before close is not needed.) (请也关闭阅读器,不需要在关闭之前冲洗。)

There is one thing: zero-width combining diacritical marks. 有一件事情:零宽度组合变音标记。 Small c-circumflex, ĉ , is one character in the Unicode table, code-point U+0109, java "\ĉ", but can also be two Unicode code-points: c , plus a zero-width ^ , "e\̂". 小C-抑扬, ĉ ,是Unicode编码表,代码点U + 0109,JAVA “\\ u0109”一个字符,但也可以是两个Unicode码点: c ,再加上一个零宽度^ ,“E \\ u0302" 。

There exists a text normalization in java which transforms into a specific form. Java中存在一种文本规范化 ,可以转换为特定形式。

String cCircumflex = "\u0109"; // c^
String cWithCircumflex = "c\u0302"; // c^

String cx = Normalizer.normalize(cCircumflex, Normalizer.Form.NFKC);
String cx2 = Normalizer.normalize(cWithCircumflex, Normalizer.Form.NFKC);
assert cx.equals(cx2);

Which normalisation to chose from is more or less irrelevant, composition ( ...C ) seeming most natural (and gives better font rendering), but decomposition ...D allows natural sorting to be "aäá...cĉ...eé...". 从哪个规范化中选择或多或少无关紧要,组成( ...C )看起来最自然(并提供更好的字体渲染),但是分解...D允许自然排序为“aäá...cĉ...eé ...“。

You could even search words, with diacritical marks removed (cafe versus café): 您甚至可以搜索已删除变音符号的单词(咖啡馆与咖啡馆):

word = Normalizer.normalize(word, Normalizer.Form.NFKD); // Decompose.
word = word.replaceAll("\\p{M}", ""); // Remove diacriticals.
word = word.replaceAll("\\p{C}", ""); // Optional: invisible control characters.

After running the original code 运行原始代码后

It seems to work with me, without any change (Java 8). 它似乎可以与我一起使用,而无需进行任何更改(Java 8)。 Though I had to put kannada.txt on C:\\ . 虽然我必须将kannada.txt放在C:\\

ಅದರಲ್ಲಿ
್ರಪಂಚದಲ್ಲಿ ಅನೇಕ ಮಾಧ್ಯಮಗಳು ಇದೆ. ಆಕಾಶವಾಣಿ, ದೂರದರ್ಶನ, ವಾರ್ತಾ ಪತ್ರಿಕೆ ಮುಂತಾದವು ಅದರಲ್ಲಿ ದೂರದರ್ಶನಪ ಪ್ರಮುಖವಾದ ಕಾರ್ಯವನ್ನು ಹೊಂದಿದ್ದು  ಅದನ್ನು ಚಿಕ್ಕವರಿಂದ ಹಿಡಿದು ದೊಡ್ಡವರವರೆಗೂ ನೋಡುತ್ತಾರೆ. ಇದಕ್ಕೆ ಇಂಗ್ಲೀಷ್‌ನಲ್ಲಿ ಟೆಲಿವಿಷನ್ ಎಂದು ಚಿಕ್ಕದಾಗಿ ಟಿ.ವಿ. ಎಂದು ಕರೆಯುವ ಬದಲು ಟಿ.ಕೆ. ಎಂದು  ಕರೆಯಬೇಕಾಗಿತ್ತು. ಏಕೆಂದರೆ ಇದು ಟೆಲಿವಿಷನ್ ಅಷ್ಟೇ ಅಲ್ಲ ಟೈಮ್ ಕಿಲ್ಲರ್ ಕೂಡ. ಇದನ್ನು ಪ್ರಮುಖವಾಗಿ ವಯಸ್ಸಾದವರು ನೋಡುತ್ತಾರೆ. ಆದರೆ ಕೆಲಸಕ್ಕೆ ಬಂದ  ಕೆಲಸದವರು ತಾವು ಕೆಲಸ ಮಾಡುವ ಬದಲು ಮನೆಯಲ್ಲಿ ಕುಳಿತು ನೋಡುತ್ತಾರೆ. 
true

false
ನನ್ನ ಪ್ರಕಾರ ಹೇಳಬೇಕಾದರೆ ಡಾಕ್ಷರ್‌ಗಳಿಗೆ ದುಡ್ಡು ಕೊಡುವ ಮಹಾಲಕ್ಷ್ಮಿ ಈ ಟಿ.ವಿ. 
false
c:\kannada.txt

String objects actually have fixed UTF-16 encoding. 字符串对象实际上具有固定的UTF-16编码。

byte[] has technically no encoding. byte []技术上没有编码。 but you can attach an encoding to byte[] tough. 但您可以将编码附加到byte []上。 so if you need UTF-8 encoded data, you need a byte[]. 因此,如果您需要UTF-8编码的数据,则需要一个byte []。

so my approach would be 所以我的方法是

byte[] text = String.getBytes("UTF-8");

to get an UTF-8 byte[].. 获得一个UTF-8字节[]。

IMHO but findeing a substring in a string (which is fully UTF-16!) which is UTF-8 encoded is senseless :) 恕我直言,但在字符串中找到子字符串(完全是UTF-16!),它是UTF-8编码的,这是毫无意义的:)

Thank you all for your help. 谢谢大家的帮助。 Now i'm able to find the substring.It worked when i made the text to be on next line in word.txt file and read that word in second readLine() statement. 现在我可以找到子字符串了。当我将文本放在word.txt文件的下一行并在第二个readLine()语句中读取该单词时,它就起作用了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM