简体   繁体   English

Node.js将字符串从ISO-8859-2转换为UTF-8

[英]Node.js convert string from ISO-8859-2 to UTF-8

When I am downloading page content by Node.js Request and the content is encoded by ISO-8859-2 , it is impossible to convert it to UTF-8 . 当我通过Node.js 请求下载页面内容并且内容通过ISO-8859-2编码时,无法将其转换为UTF-8

I am using node-iconv for it. 我正在使用node-iconv

Code: 码:

request('https://www.jakpsatweb.cz', function(err, resp, body){
    const title = regexToRetrieveTitle(body);
    const iconv = new Iconv('ISO-8859-2', 'UTF-8');
    const buffer = iconv.convert(title);
    console.log(buffer);
    console.log(buffer.toString('UTF8'));
})

Console: 安慰:

<Buffer 52 65 6b 6c 61 6d 61 3a 20 6a 61 6b 20 66 75 6e 67 75 6a 65 20 77 65 62 6f 76 c4 8f c5 bc cb 9d 20 72 65 6b 6c 61 6d 61>
Reklama: jak funguje webovďż˝ reklama

Expected result: 预期结果:

Reklama: jak funguje webová reklama

Do anyone know where is problem? 有人知道哪里出问题吗?

EDIT: 编辑:

For example I download THIS PAGE . 例如,我下载了此页面 I recognised ISO-8859-2 by meta tags (chrome browser also) and I need to convert the content of page and save to database. 我通过元标记(也是Chrome浏览器)识别了ISO-8859-2,我需要转换页面内容并将其保存到数据库。 My Database is UTF-8 therefore I need to encode it. 我的数据库是UTF-8,因此我需要对其进行编码。

The conversion from ISO-8859-2 to UTF-8 worked fine. 从ISO-8859-2到UTF-8的转换效果很好。 It was the input (the title variable) that has a wrong contents: The title contains the bytes EF BF BD. 内容错误的是输入(标题变量):标题包含字节EF BF BD。 This means that the title was already UTF-8 encoded, but with a U+FFFD (REPLACEMENT CHARACTER) in the place where you would expect the letter á (LATIN SMALL LETTER A WITH ACUTE). 这意味着标题已经采用UTF-8编码,但是在您希望字母á出现的地方加上了U + FFFD(替换字符)(带有ACUTE的拉丁文小写字母A)。

Now, the original web page https://www.jakpsatweb.cz/reklama/index.html is correctly encoded in ISO-8859-2 and also has the required charset declaration in the <head> section. 现在,原始网页https://www.jakpsatweb.cz/reklama/index.html已在ISO-8859-2中正确编码,并且在<head>部分中还具有必需的字符集声明。

Therefore the problem must be in the software that downloads the web page (NodeJS) or the regexToRetrieveTitle function. 因此,问题必须出在下载网页(NodeJS)或regexToRetrieveTitle函数的软件中。

The problem is in Node.js request. 问题出在Node.js请求中。 There is encoding set to UTF8 by default. 默认情况下,编码设置为UTF8。 I had to set it to null and now everything works fine. 我必须将其设置为null ,现在一切正常。

request({ uri: 'https://www.jakpsatweb.cz', encoding: null}, function(err, resp, body){
    .....
})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM