简体   繁体   English

从请求 NodeJS 正确编码正文

[英]Correct encoding for body from Request NodeJS

I'm trying to scrape a web-page for some data and I managed to post a request and got the right data.我正在尝试抓取一些数据的网页,我设法发布了一个请求并获得了正确的数据。 The problem is that I get something like :问题是我得到了类似的东西:

"Kannst du bitte noch einmal ... erzýhlen , wie du wýhrend der Safari einen Lýwen verjagt hast?" “Kannst du bitte noch einmal ... erzýhlen , wie du wýhrend der Safari einen Lýwen verjagt hast?”

normally erzählen - während , so Ä,Ö,ß,Ü are not showing correctly.通常erzählen - während ,所以 Ä,Ö,ß,Ü 显示不正确。

here is my code:这是我的代码:

var querystring = require('querystring');
var iconv = require('iconv-lite')
var request = require('request');
var fs = require('fs');
var writer = fs.createWriteStream('outputBodyutf8String.html');


var form = {
    id:'2974',
    opt1:'',
    opt2:'30',
    ref:'A1',
    tid:'157',
    tid2:'',
    fnum:'2'
};

var formData = querystring.stringify(form);
var contentLength = formData.length;

request({
    headers: {
        'Content-Length': contentLength,
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    uri: 'xxxxxx.php',
    body: formData,
    method: 'POST'
}, function (err, res, body) {
    var utf8String = iconv.decode(body,"ISO-8859-1");
     console.log(utf8String);
    writer.write(utf8String);
});

how to get the HTML body in with the correct letters?如何使用正确的字母获取 HTML 正文?

How do I find out the correct encoding of a response?如何找出响应的正确编码?

I went to the website you are attempting to scrape, and found this:我去了你试图抓取的网站,发现了这个:

在此处输入图片说明

And another character encoding declaration here:还有另一个字符编码声明:

在此处输入图片说明

This website defined two different charater encodings!这个网站定义了两种不同的字符编码! Which do I use?我用哪一个?

Well, this doesn't apply to you.嗯,这不适用于你。 When reading an HTML file from a local machine, then the charset or content-type defined in the meta tags will be used for encoding.从本地机器读取 HTML 文件时,将使用元标记中定义的charsetcontent-type进行编码。

Since you are retrieving this document, over HTTP, the files will be encoded according to the response header.由于您正在通过 HTTP 检索此文档,因此文件将根据响应标头进行编码。

Here's the reponse header I received after visiting the website.这是我访问网站后收到的响应标题。

在此处输入图片说明

As you can see, they don't have a defined character set.如您所见,它们没有定义的字符集。 It should be located in the Content-Type property.它应该位于Content-Type属性中。 Like this:像这样:

在此处输入图片说明

Since they don't have any indicated charset in the response header, then, according to this post , it should use the meta declaration.由于它们在响应标头中没有任何指示的charset ,因此根据这篇文章,它应该使用meta声明。

But wait , there was two meta charset declarations.但是等等,有两个meta charset声明。

Since the compiler reads the file top to bottom, the second declared charset should be used.由于编译器从上到下读取文件,因此应使用第二个声明的charset

Conclusion: They use UTF-8结论:他们使用UTF-8

Also, I don't think you need the conversion.另外,我认为您不需要转换。 I may be wrong, but you should just be able to access the response.我可能错了,但您应该能够访问响应。

request({
    headers: {
        'Content-Length': contentLength,
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    uri: 'xxxxxx.php',
    body: formData,
    method: 'POST'
}, function (err, res, body) {
    console.log(body);
    writer.write(body);
});

Edit : I don't believe the error is on their side.编辑我不相信错误在他们一边。 I believe it's on your side.我相信它就在你身边。 Give this a try:试试这个:

Remove the writer:删除作者:

var writer = fs.createWriteStream('outputBodyutf8String.html');

And in the request callback, replace everything with this:request回调中,将所有内容替换为:

function (err, res, body) {
    console.log(body);
    fs.writeFile('outputBodyutf8String.html', body, 'utf8', function(error) {
        if(error)
            console.log('Error Occured', error);
    );
}

All the code should look like this:所有代码应如下所示:

var querystring = require('querystring');
var iconv = require('iconv-lite')
var request = require('request');
var fs = require('fs');

var form = {
    id:'2974',
    opt1:'',
    opt2:'30',
    ref:'A1',
    tid:'157',
    tid2:'',
    fnum:'2'
};

var formData = querystring.stringify(form);
var contentLength = formData.length;

request({
    headers: {
        'Content-Length': contentLength,
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    uri: 'xxxxxxx.php',
    body: formData,
    method: 'POST'
}, function (err, res, body) {
    console.log(body);
    fs.writeFile('outputBodyutf8String.html', body, 'utf8', function(error) {
        if(error)
            console.log('Error Occured', error);
    );
}

Можно сделать запрос с encoding: binary и затем : Можно сделать запрос с 编码:二进制 и затем :

let decoder = new util.TextDecoder ('windows-1251');
    console.log(decoder.decode(Buffer.from(b.toString(), 'binary'),{ stream: true }));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM