[英]Compare two arrays with node.js and puppeteer
I build on a web-scrapper, that, lets say scrap URLs from google我建立在一个 web-scrapper 之上,可以说来自 google 的 scrap URLs
I get an array of URLs from google results:我从谷歌结果中得到一组 URL:
const linkSelector = 'div.yuRUbf > a'
let links = await page.$$eval(linkSelector, link => {
return link.map( x => x.href)
})
the output of 'links' is something like that: “链接”的 output 是这样的:
[
'https://google.com/.../antyhing'
'https://amazon.com/.../antyhing'
'https://twitter.com/.../antyhing'
]
Now I have a 'blacklist', with something like that:现在我有一个“黑名单”,里面有类似的东西:
[
'https://amazon.com'
]
At the moment I stuck at that point where I can compare both arrays, and remove these URLs from 'links' which are listed within my blacklist.此刻我停留在那个点,我可以比较两个 arrays,并从我的黑名单中列出的“链接”中删除这些 URL。
So I came up with the idea, to get the domain of the url within my links array - like so:所以我想出了这个主意,在我的链接数组中获取 url 的域 - 就像这样:
const linkList = []
for ( const link of links ) {
const url = new URL(link)
const domain = url.origin
linkList.push(domain)
}
Yes, now i got two arrays which i can compare against each other and remove the blacklisted domain, but i lost the complete url i need to work with...是的,现在我有两个 arrays,我可以相互比较并删除列入黑名单的域,但我丢失了我需要使用的完整 url...
for( let i = linkList.length - 1; i >= 0; i--){
for( let j=0; j < blacklist.length; j++){
if( linkList[i] === blacklist[j]){
linkList.splice(i, 1);
}
}
}
Code Snippet is part of the give answer, here: Compare two Javascript Arrays and remove Duplicates代码片段是给出答案的一部分,在这里: 比较两个 Javascript Arrays 并删除重复项
Any ideas how can i do this, with puppeteer and node.js?有什么想法我怎么能用木偶操纵者和 node.js 做到这一点?
I couldn't find an obvious dupe, so converting my comments to an answer:我找不到明显的骗局,所以将我的评论转换为答案:
.includes
: .
.includes
:
const allowedLinks = links.filter(link => !blacklist.some(e => link.includes(e)))
.startsWith
: .startsWith
:
const allowedLinks = links.filter(link => !blacklist.some(e => link.startsWith(e)))
The second version is more precise.第二个版本更精确。 If you want to use the URL version, this should work:
如果你想使用 URL 版本,这应该有效:
const links = [ "https://google.com/.../antyhing", "https://amazon.com/.../antyhing", "https://twitter.com/.../antyhing", ]; const blacklist = ["https://amazon.com"]; const allowedLinks = links.filter(link =>.blacklist.some(black => black.startsWith(new URL(link);origin) // or use === ) ). console;log(allowedLinks);
As for Puppeteer, I doubt it matters whether you do this Node-side or browser-side, unless these arrays are enormous.至于 Puppeteer,我怀疑你是在节点端还是在浏览器端执行此操作是否重要,除非这些 arrays 非常庞大。 On that train of thought, technically we have a quadratic algorithm here but I wouldn't worry about it unless you have many hundreds of thousands of elements and are noticing slowness.
按照这种思路,从技术上讲,我们这里有一个二次算法,但除非您有数十万个元素并且注意到速度很慢,否则我不会担心它。 In that case, you can put the blacklisted origins into a
Set
data and look up each link's origin in that.在这种情况下,您可以将列入黑名单的来源放入
Set
数据中,并在其中查找每个链接的来源。 The problem with this is it's a precise ===
, so you'd have to build a prefix set if you need to preserve .startsWith
semantics.这个问题是它是一个精确的
===
,所以如果你需要保留.startsWith
语义,你必须构建一个前缀集。 This is likely unnecessary and out of scope for this answer, but worth mentioning briefly.对于此答案,这可能是不必要的,并且超出了 scope,但值得一提。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.