简体   繁体   中英

cheerio sometimes returns empty string

I'm scraping Genius.com for lyrics; I've googled and can't seem to find a reason for why my code isn't working. I am scraping the text from the div on a Genius.org page (ie, https://genius.com/Britney-spears-baby-one-more-time-lyrics ).

Viewing the page source, it appears the div exists and is populated with text in the source and not by Javascript or otherwise (if it was, wouldn't cheerio work zero percent of the time in this context?) When I run my code, it works 50% of the time; other times it returns an empty.

I saw this but this seems like a hack-ey solution and I don't really see why my async/await isn't working for the full response from phin...

Here's the code in question

const scraperRouter = require('express').Router()
const p = require('phin')
const cheerio = require('cheerio')

scraperRouter.get('/', async (req, res) => {
    
        const url = req.header('geniusUrl')
    
        const _res = await p(url)
        
        try {
            let $ = cheerio.load(_res.body)
            const lyrics = $('.lyrics').text()
    
            res.send(lyrics)
        }
        catch (e) {
            console.log(e)
            res.json(e)
        }
    })

Any advice appreciated. Thanks.

Converting my comment to an answer after OP confirmed it as the solution:

Sometimes this happens when sites are A/B testing. They might redirect you to one of a couple DOMs. There might also be regional differences. I recommend trying to access it from a couple different IPs, browsers, regions, etc to try to figure out if there's a pattern. If you can narrow it down to a couple of different DOMs, then you can conditionally try both.

This can also occur due to rate limiting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM