简体   繁体   中英

Socket InputStream and UTF-8

I'm trying to make a chat with Java. Everything works fine, except that special characters doesn't work. I think that it's an encoding problem because in my Outputstream I encode the string in UTF-8 like this:

  protected void send(String msg) {
    
        try {
          msg+="\r\n";            
          OutputStream outStream = socket.getOutputStream();              
          outStream.write(msg.getBytes("UTF-8"));
          System.out.println(msg.getBytes("UTF-8"));
          outStream.flush();
        }
        catch(IOException ex) {
          ex.printStackTrace();
        }
      }

But in my receive method I didn't find a way to do this:

public String receive() throws IOException {
   
    String line = "";
    InputStream inStream = socket.getInputStream();    
                
    int read = inStream.read();
    while (read!=10 && read > -1) {
      line+=String.valueOf((char)read);
      read = inStream.read();
    }
    if (read==-1) return null;
    line+=String.valueOf((char)read);       
    return line; 
    
  }

So there is a quick way to specify that the bytes read by the buffer are encoded with UTF-8?

EDIT: Okay, I tried with the BufferedReader like this:

 public String receive() throws IOException {
    
    String line = "";           
    in = new BufferedReader(new InputStreamReader(socket.getInputStream(), "UTF-8"));           
    String readLine = "";   
    
    while ((readLine = in.readLine()) != null) {
        line+=readLine;
    }
    
    System.out.println("Line:"+line);
    
    return line;
   
  }

But it doesn't work. It seems that the socket doesn't receive anything.

Trying to throw more light for future visitors.

Rule of thumb: Server and client HAS TO sync between encoding scheme, because if client is sending data encoded using some encoding scheme and server is reading the data using other encoding scheme, then exepcted results can NEVER be achieved.

Important thing to note for the folks who try to test this is that do not encoded in ASCII at client side (or in other words using ASCII encoding at client side) and decode using UTF8 at server side (or in other words using UTF8 encoding at server side) because UTF8 is backward compatible with ASCII, so may feel that "Rule of thumb" is wrong, but no, its not, so better use UTF8 at client side and UTF16 at server side and you will understand.

Encoding with sockets

I guess single most important thing to understand is: finally over the socket you are going to send BYTES but it all depends how those bytes are encoded .

For example, if I send input to server (over client-server socket) using my windows command prompt then the data will be encoded using some encoding scheme (I really do not know which), and if I send data to server using another client code/program then I can specify the encoding scheme which I want to use for my client socket's o/p stream, and then all the data will be converted/encoded into BYTES using that encoding scheme and sent over the socket.

Now, finally I am still sending the BYTES over the wire but those are encoded using the encoding scheme which I specified. And if suppose at server side, I use another encoding scheme while reading over the socket's i/p stream then expected results cannot be achieved, and if I use same encoding scheme (same as client's encoding scheme) at server as well then everything will be perfect .

Answering this question

In Java, there are special "bridge" streams (read here ) which you can use to specify encoding of the stream.

PLEASE NOTE: in Java InputStream and OutputStream are BYTE streams, so everything read from and written into using these streams will be BYTES, you cannot specify encoding using objects of InputStream and OutputStream classes, so you can use Java bridge classes.

Below is the code snippet of client and server, where I am trying to show how to specify encoding over the client's output stream and server's input stream .

As long as I specify same encoding at both end, everything will be perfect.

Client side:

        Socket clientSocket = new Socket("abc.com", 25050);
        OutputStreamWriter clientSocketWriter = (new OutputStreamWriter(clientSocket.getOutputStream(), "UTF8"));

Server side:

    ServerSocket serverSocket = new ServerSocket(8001);
    Socket clientSocket = serverSocket.accept();
    // PLEASE NOTE: important thing below is I am specifying the encoding over my socket's input stream, and since Java's <<InputStream>> is a BYTE stream,  
    // so in order to specify the encoding I am using Java I/O's bridge class <<InputStreamReader>> and specifying my UTF8 encoding.
    // So, with this all my data (BYTES really) will be read from client socket as bytes "BUT" those will be read as UTF8 encoded bytes.
    // Suppose if I specify different encoding here, than what client is specifying in its o/p stream than data cannot read properly and may be all "?"
    InputStreamReader clientSocketReader = (new InputStreamReader(clientSocket.getInputStream(), "UTF8"));

try

BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream(), "UTF-8"));

then

String readLine = "";
while ((readLine = in.readLine()) != null) {
    line+=readLine
}

Use an InputStreamReader and OutputStreamWriter both created with utf-8 as the character encoding.

If you want to read entire lines of content, you can wrap the InputStreamReader with a BufferedReader . Similarly, you can use a BufferedWriter or PrintWriter wrapped around the OutputStreamWriter to write out data as lines.

You should understand the difference between unicode chars and bytes . The short of it is that unicode character points (Java char s, more or less) are the same regardless of the encoding. The encoding changes what chars a given byte sequence translates to.

In your code, you've got a String , which is really just a sequence of char s. You translate that to a sequence of byte s using getBytes("UTF-8") . When you read it back, you're reading back each individual byte (as an int , but that's a detail) -- not each char . You try to convert these bytes to chars using plain casting, which only works when the code point value of the char is exactly equal to the int value of the byte; for UTF-8, this is only the case for "normal" characters.

You should instead reconstruct a String based on the bytes from the input stream, and the charset. One way to do this is to read the InputStream into a byte[] and then call new String(byte[] bytes, String charset) .

You could also use a Reader which represents a readable stream of characters. InputStreamReader reads an InputStream as the source of its character stream, and BufferedReader can then take that character stream and use it to produce String s, one line at a time, as ProgrammerJeff's answer illustrates.

This worked for me, Server side code:

    try {   
    Scanner input = new Scanner(new File("myfile.txt"),"UTF-8");
    //send the first line only
    String line=input.nextLine();
    ServerSocket server = new ServerSocket(12345);
    Socket client = server.accept();
    PrintWriter out = new PrintWriter(
    new BufferedWriter(new OutputStreamWriter(
        client.getOutputStream(), "UTF-8")), true);
    out.println(line);
    out.flush();
    input.close();
    server.close();
    }catch (Exception e) {
        e.printStackTrace();
    }

Client side:

Socket mysocket = new Socket(SERVER_ADDR, 12345);
       bfr = new BufferedReader(new 
                InputStreamReader(mysocket.getInputStream(), "UTF-8"));
String tmp=bfr.readLine();

The text file should be encoded as UTF-8

BufferedReader rd  = null;
rd  = new BufferedReader(new InputStreamReader(connection.getInputStream(),"UTF-8"));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM