简体   繁体   English

Javascript 是否可以生成 Cheerio 无法提取的 DOM html?

[英]Is it possible for Javascript to generate a DOM html that is unextractable by Cheerio?

I am trying to extract the price from this webpage: https://www.allbirds.com/products/mens-wool-runner-up-mizzles-natural-grey?size=13我想从这个网页中提取价格: https : //www.allbirds.com/products/mens-wool-runner-up-mizzles-natural-grey? size =13

I narrowed it down to these divs:我将范围缩小到这些 div:

<div class="jsx-3947815802 Container">
<div class="jsx-526902087 Grid">
<div class="jsx-2943457050 Grid__cell Grid__cell--small-12 Grid__cell--medium-7 Grid__cell--large-up-8">...

The jsx-{random_number} for the class names is suspicious to me.类名的 jsx-{random_number} 对我来说很可疑。 They seem generated on the fly.它们似乎是即时生成的。 The price I need is inside divs like these.我需要的价格在像这样的 div 里面。 However, these don't exist in the page source and or the cheerio object I am using during runtime.但是,这些不存在于我在运行时使用的页面源和/或cheerio 对象中。 It just disappears.它只是消失了。

How common is this technique?这种技术有多普遍? It seems like a pretty good way to stop web scrapers.这似乎是阻止网络爬虫的好方法。 How do I get around it?我该如何解决?

If those classes are random, it might be annoying, but it's not a deal-breaker, because the other classes look to be static.如果这些类是随机的,它可能会很烦人,但它不会破坏交易,因为其他类看起来是静态的。

For example, the element that includes the price looks something like:例如,包含价格的元素类似于:

<p class="jsx-3188494938 Paragraph PdpMasterProductDetails__paragraph">$135</p>

The PdpMasterProductDetails__paragraph does not change. PdpMasterProductDetails__paragraph不会改变。 So, you can retrieve the text by using that as a selector:因此,您可以将其用作选择器来检索文本:

$('.PdpMasterProductDetails__paragraph').text()

You can also retrieve the price from a meta tag:您还可以从元标记中检索价格:

<meta property="og:price:amount" content="135">

which can be selected via the selector string:可以通过选择器字符串选择:

meta[property="og:price:amount"]

How common is this technique?这种技术有多普遍?

Very.非常。

Building websites as Single Page Applications with tools like React is very common.使用 React 等工具将网站构建为单页应用程序非常普遍。

It seems like a pretty good way to stop web scrapers.这似乎是阻止网络爬虫的好方法。

It isn't.不是。

How do I get around it?我该如何解决?

Hit the web service the React code fetches the raw data from directly.点击 Web 服务,React 代码直接从中获取原始数据。 It's easily discoverable via the Network tab in the browser's developer tools.可以通过浏览器开发人员工具中的“网络”选项卡轻松找到它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM