将两个 arrays 与 node.js 和 puppeteer 进行比较

Question

I build on a web-scrapper, that, lets say scrap URLs from google我建立在一个 web-scrapper 之上，可以说来自 google 的 scrap URLs

I get an array of URLs from google results:我从谷歌结果中得到一组 URL：

const linkSelector = 'div.yuRUbf > a'
let links = await page.$$eval(linkSelector, link => {
     return link.map( x => x.href)
})

the output of 'links' is something like that: “链接”的 output 是这样的：

[
'https://google.com/.../antyhing'
'https://amazon.com/.../antyhing'
'https://twitter.com/.../antyhing'
]

Now I have a 'blacklist', with something like that:现在我有一个“黑名单”，里面有类似的东西：

[
'https://amazon.com'
]

At the moment I stuck at that point where I can compare both arrays, and remove these URLs from 'links' which are listed within my blacklist.此刻我停留在那个点，我可以比较两个 arrays，并从我的黑名单中列出的“链接”中删除这些 URL。

So I came up with the idea, to get the domain of the url within my links array - like so:所以我想出了这个主意，在我的链接数组中获取 url 的域 - 就像这样：

const linkList = []
for ( const link of links ) {

const url = new URL(link)
const domain = url.origin
linkList.push(domain)

}

Yes, now i got two arrays which i can compare against each other and remove the blacklisted domain, but i lost the complete url i need to work with...是的，现在我有两个 arrays，我可以相互比较并删除列入黑名单的域，但我丢失了我需要使用的完整 url...

for( let i = linkList.length - 1; i >= 0; i--){
  for( let j=0; j < blacklist.length; j++){
    if( linkList[i] === blacklist[j]){
      linkList.splice(i, 1);
    }
  }
}

Code Snippet is part of the give answer, here: Compare two Javascript Arrays and remove Duplicates代码片段是给出答案的一部分，在这里：比较两个 Javascript Arrays 并删除重复项

Any ideas how can i do this, with puppeteer and node.js?有什么想法我怎么能用木偶操纵者和 node.js 做到这一点？

Answer 1

I couldn't find an obvious dupe, so converting my comments to an answer:我找不到明显的骗局，所以将我的评论转换为答案：

.includes : . .includes ：

const allowedLinks = links.filter(link => !blacklist.some(e => link.includes(e)))

.startsWith : .startsWith :

const allowedLinks = links.filter(link => !blacklist.some(e => link.startsWith(e)))

The second version is more precise.第二个版本更精确。 If you want to use the URL version, this should work:如果你想使用 URL 版本，这应该有效：

 const links = [ "https://google.com/.../antyhing", "https://amazon.com/.../antyhing", "https://twitter.com/.../antyhing", ]; const blacklist = ["https://amazon.com"]; const allowedLinks = links.filter(link =>.blacklist.some(black => black.startsWith(new URL(link);origin) // or use === ) ). console;log(allowedLinks);

As for Puppeteer, I doubt it matters whether you do this Node-side or browser-side, unless these arrays are enormous.至于 Puppeteer，我怀疑你是在节点端还是在浏览器端执行此操作是否重要，除非这些 arrays 非常庞大。 On that train of thought, technically we have a quadratic algorithm here but I wouldn't worry about it unless you have many hundreds of thousands of elements and are noticing slowness.按照这种思路，从技术上讲，我们这里有一个二次算法，但除非您有数十万个元素并且注意到速度很慢，否则我不会担心它。 In that case, you can put the blacklisted origins into a Set data and look up each link's origin in that.在这种情况下，您可以将列入黑名单的来源放入Set数据中，并在其中查找每个链接的来源。 The problem with this is it's a precise === , so you'd have to build a prefix set if you need to preserve .startsWith semantics.这个问题是它是一个精确的=== ，所以如果你需要保留.startsWith语义，你必须构建一个前缀集。 This is likely unnecessary and out of scope for this answer, but worth mentioning briefly.对于此答案，这可能是不必要的，并且超出了 scope，但值得一提。

将两个 arrays 与 node.js 和 puppeteer 进行比较

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-29 19:09:41

将两个 arrays 与 node.js 和 puppeteer 进行比较

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-29 19:09:41

解决方案1
0 已采纳 2022-09-29 19:09:41