简体   繁体   English

网络抓取修复坏字符符号

[英]web-scraping fix bad character symbol

to begin, i'm french so sorry if my english isn't perfect.首先,如果我的英语不完美,我很抱歉。

I'm role player and i need to scraper this http://www.gemmaline.com/sorts/liste-classe-pretre.htm which is in Iso-8859-1 to catch each text list and later to catch each information from each link from name list.我是角色扮演者,我需要刮掉位于 Iso-8859-1 中的http://www.gemmaline.com/sorts/liste-classe-pretre.htm以捕获每个文本列表,然后再从其中捕获每个信息名单中的每个链接。 i have already formulas on google sheet to scraper it.我已经在谷歌表上有公式来刮它。 But like lot of know, importxml in google sheet it's 50 request max, and it's very slow.但就像很多人知道的那样,谷歌表中的 importxml 最多 50 个请求,而且速度非常慢。 So i try a different process with javascript and node js.所以我尝试使用 javascript 和节点 js 进行不同的处理。 To use axios and cheerio to scraper.使用 axios 和cheerio 来刮板。 It works but the result is uncorrect for each accented character or single quote.它可以工作,但每个重音字符或单引号的结果都不正确。 And after lot of try i didn't solve my issue.经过大量尝试,我没有解决我的问题。 This is the code:这是代码:

const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const fs = require('fs')

const app = express()

const url = 'http://www.gemmaline.com/sorts/liste-classe-pretre.htm'

axios(url)
    .then(response => {
        const html = response.data
        const $ = cheerio.load(html)
        const data = []

        $('body:nth-child(2) ul li').each(function() {
            
            //const encoder = new TextEncoder()
            //const name = new TextDecoder().decode(new Uint8Array(encoder.encode($(this).text())))

            const name = $(this).text()
            const url = $(this).find('a').attr('href')

            data.push({ name, url})
        })

        fs.writeFileSync('test.json', JSON.stringify(data, null, 1))
    }).catch(err => console.log(err))

    

app.listen(PORT, ()=>console.log(`server running on port ${PORT}`))

and this is my result in a file, btw it's the same if i did just a console.log(data):这是我在一个文件中的结果,顺便说一句,如果我只做了一个 console.log(data):

结果

Now you can see strange symbol.现在你可以看到奇怪的符号。 If someone know how to fix it, i will be really happy.如果有人知道如何解决它,我会很高兴。

it works !有用 !

const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const fs = require('fs')

const app = express()

const url = 'http://www.gemmaline.com/sorts/liste-classe-pretre.htm'


axios(url, {
    
    responseEncoding: 'binary'
})
    .then(response => {

        const $ = cheerio.load(response.data.toString('ISO-8859-1'),{decodeEntities: false})
        const data = []

        $('body:nth-child(2) ul li').each(function() {

            const name = $(this).text()
            const url = $(this).find('a').attr('href')

            data.push({ name, url })
        })

        fs.writeFileSync('test2.json', JSON.stringify(data, null, 1))
    }).catch(err => console.log(err))


app.listen(PORT, () => console.log(`server running on port ${PORT}`))

wrong character line 11错误的字符第 11 行

I have now a new issue, a single quote at line 11 should be not become '.我现在有一个新问题,第 11 行的单引号不应该变成 '. On this line it's normaly Conservation d'organe Nécromancie instead of Conservation dorgane Nécromancie .在这条线上,通常是Conservation d'organe Nécromancie而不是Conservation dorgane Nécromancie After, i could just substitute this symbol by a single quote for each time i get it but maybe we can find something more clean directly by encoding之后,我每次得到它时都可以用一个单引号代替这个符号,但也许我们可以通过编码直接找到更干净的东西

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM