简体   繁体   English

将两个 arrays 与 node.js 和 puppeteer 进行比较

[英]Compare two arrays with node.js and puppeteer

I build on a web-scrapper, that, lets say scrap URLs from google我建立在一个 web-scrapper 之上,可以说来自 google 的 scrap URLs

I get an array of URLs from google results:我从谷歌结果中得到一组 URL:

const linkSelector = 'div.yuRUbf > a'
let links = await page.$$eval(linkSelector, link => {
     return link.map( x => x.href)
})

the output of 'links' is something like that: “链接”的 output 是这样的:

[
'https://google.com/.../antyhing'
'https://amazon.com/.../antyhing'
'https://twitter.com/.../antyhing'
]

Now I have a 'blacklist', with something like that:现在我有一个“黑名单”,里面有类似的东西:

[
'https://amazon.com'
]

At the moment I stuck at that point where I can compare both arrays, and remove these URLs from 'links' which are listed within my blacklist.此刻我停留在那个点,我可以比较两个 arrays,并从我的黑名单中列出的“链接”中删除这些 URL。

So I came up with the idea, to get the domain of the url within my links array - like so:所以我想出了这个主意,在我的链接数组中获取 url 的域 - 就像这样:

const linkList = []
for ( const link of links ) {

const url = new URL(link)
const domain = url.origin
linkList.push(domain)

}

Yes, now i got two arrays which i can compare against each other and remove the blacklisted domain, but i lost the complete url i need to work with...是的,现在我有两个 arrays,我可以相互比较并删除列入黑名单的域,但我丢失了我需要使用的完整 url...

for( let i = linkList.length - 1; i >= 0; i--){
  for( let j=0; j < blacklist.length; j++){
    if( linkList[i] === blacklist[j]){
      linkList.splice(i, 1);
    }
  }
}

Code Snippet is part of the give answer, here: Compare two Javascript Arrays and remove Duplicates代码片段是给出答案的一部分,在这里: 比较两个 Javascript Arrays 并删除重复项

Any ideas how can i do this, with puppeteer and node.js?有什么想法我怎么能用木偶操纵者和 node.js 做到这一点?

I couldn't find an obvious dupe, so converting my comments to an answer:我找不到明显的骗局,所以将我的评论转换为答案:

.includes : . .includes

const allowedLinks = links.filter(link => !blacklist.some(e => link.includes(e)))

.startsWith : .startsWith :

const allowedLinks = links.filter(link => !blacklist.some(e => link.startsWith(e)))

The second version is more precise.第二个版本更精确。 If you want to use the URL version, this should work:如果你想使用 URL 版本,这应该有效:

 const links = [ "https://google.com/.../antyhing", "https://amazon.com/.../antyhing", "https://twitter.com/.../antyhing", ]; const blacklist = ["https://amazon.com"]; const allowedLinks = links.filter(link =>.blacklist.some(black => black.startsWith(new URL(link);origin) // or use === ) ). console;log(allowedLinks);

As for Puppeteer, I doubt it matters whether you do this Node-side or browser-side, unless these arrays are enormous.至于 Puppeteer,我怀疑你是在节点端还是在浏览器端执行此操作是否重要,除非这些 arrays 非常庞大。 On that train of thought, technically we have a quadratic algorithm here but I wouldn't worry about it unless you have many hundreds of thousands of elements and are noticing slowness.按照这种思路,从技术上讲,我们这里有一个二次算法,但除非您有数十万个元素并且注意到速度很慢,否则我不会担心它。 In that case, you can put the blacklisted origins into a Set data and look up each link's origin in that.在这种情况下,您可以将列入黑名单的来源放入Set数据中,并在其中查找每个链接的来源。 The problem with this is it's a precise === , so you'd have to build a prefix set if you need to preserve .startsWith semantics.这个问题是它是一个精确的=== ,所以如果你需要保留.startsWith语义,你必须构建一个前缀集。 This is likely unnecessary and out of scope for this answer, but worth mentioning briefly.对于此答案,这可能是不必要的,并且超出了 scope,但值得一提。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM