简体   繁体   English

节点 js base64 字符串转换为 utf8 问题

[英]node js base64 string into utf8 issue

I have a string which is base64 and I need to convert it into utf-8.我有一个字符串是 base64,我需要将其转换为 utf-8。

base64_string "VABpAG0AZQAgAHMAZQByAGUAaQBzAA=="

I am trying to convert base64_string into utf-8 in the following env:我正在尝试在以下环境中将 base64_string 转换为 utf-8:

In browser在浏览器中

method : atob(base64_string)

`Result = "Time series",` 

which is correct.哪个是对的。 We can verify the same in https://www.base64decode.org我们可以在https://www.base64decode.org中验证相同的

In NodeJs I am converting with npm package "atob"在 NodeJs 中,我使用 npm package "atob" 进行转换

method : atob(base64_string)

Result = "T i m e  s e r i e s".

For some reasons, I am getting spaces between each character and I don't know why?由于某些原因,我在每个字符之间都有空格,我不知道为什么? I have tried to trim, but that is also not working.我试图修剪,但这也不起作用。

TL;DR; TL;博士;

Your string is actually UTF-16, not UTF-8.您的字符串实际上是 UTF-16,而不是 UTF-8。 Here's how to decode it properly.这是正确解码的方法。

function atob(b64txt) {
  const buff = Buffer.from(b64txt, 'base64');
  const txt = buff.toString('utf16le');
  return txt;
}

Explanation: Your base64 encoded string isn't actually UTF-8 or ASCII data.说明:您的 base64 编码字符串实际上不是 UTF-8 或 ASCII 数据。 It's UTF-16 (little-endian).它是 UTF-16(小端序)。 That means every character always has two bytes.这意味着每个字符总是有两个字节。

UTF-8 is different: any byte that is less than 127 indicates a single-byte character. UTF-8 不同:任何小于 127 的字节都表示单字节字符。 A byte greater than 127 would have a second byte, and if the second byte is > 127 there would be a third byte, etc.大于 127 的字节会有第二个字节,如果第二个字节 > 127 会有第三个字节,以此类推。

So let's decode your string to character codes and see what it looks like:因此,让我们将您的字符串解码为字符代码,看看它是什么样子:

const b64txt = 'VABpAG0AZQAgAHMAZQByAGUAaQBzAA==';
const buff = Buffer.from(b64txt, 'base64');
console.log(JSON.stringify(buff));
// >> {"type":"Buffer","data":[84,0,105,0,109,0,101,0,32,0,115,0,101,0,114,0,101,0,105,0,115,0]}

First character (84) is the ASCII character for T .第一个字符 (84) 是T的 ASCII 字符。 But it's less than 127, and it still has a 0 byte following it.但它小于 127,而且它后面还有一个0字节。 So...not UTF-8.所以...不是 UTF-8。

That's the clue that this string has two bytes per character, making it UTF-16.这就是这个字符串每个字符有两个字节的线索,使其成为 UTF-16。 And the fact that the 0 follows the character is the clue that it's "little-endian" (the 0-255 byte comes first, and the 256-65536 byte comes second).字符后面的 0 表明它是“小端序”(0-255 字节排在第一位,256-65536 字节排在第二位)。

If you want to change this buffer into text, you need to interpret it as the correct type of string:如果要将此缓冲区更改为文本,则需要将其解释为正确的字符串类型:

const txt = buff.toString('utf16le'); // <- UTF-16, little-endian
console.log(txt);
// >> "Time sereis"

So in node.js, if you combine those two commands, you end up with a full fledged solution to get your string decoded properly, as above in the TL;DR;.所以在 node.js 中,如果你结合这两个命令,你最终会得到一个完整的解决方案来正确解码你的字符串,如 TL;DR; 中所述。

Of course if your encoding type changes, you'd have to change this as well, and do toString('utf8') or whatever the appropriate encoding is.当然,如果您的编码类型发生更改,您也必须更改它,并执行toString('utf8')或任何适当的编码。

(credit: I referenced this and this as I was drafting this answer.) (信用:我在起草这个答案时引用了这个这个。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM