简体   繁体   English

常见的抓取关键字查询

[英]Common Crawl Keyword Lookup

I want to find a list of all the websites which is having a specific keywords.For example if i search for a keyword "Sports" or "Football" only the related website URLS , Title , Description and image needs to be extracted from common crawl warc files. 我想查找具有特定关键字的所有网站的列表。例如,如果我搜索关键字“ Sports”或“ Football”,则只需从常见爬网中提取相关的网站URL,标题,描述和图像warc文件。 At present i am able to read the warc file with the following code finely. 目前,我可以用以下代码很好地读取warc文件。

import warc
f = warc.open("firsttest.warc.gz")
h = warc.WARCHeader({"WARC-Type": "response",}, defaults=True)
N = 10
name="sports"
for record in f:
    url = record.header.get('warc-target-uri', 'none')
    date=record.header.get("WARC-Date")
    IP=record.header.get('WARC-IP-Address')
    payload_di=record.header.get('WARC-Payload-Digest')
    search =name in record.header
    print("URL :"+str(url))
    #print("date :"+str(date))
    #print("IP :"+str(IP))
    #print("payload_digest :"+str(payload_di))
    #print("search :"+str(search))
    text = record.payload.read()
    #print("Text :"+str(text))
    #break

    #print(url)

But it is getting all the urls in the specified warc file. 但是它正在获取指定warc文件中的所有URL。 I need only related urls that matches with "sports" or "football". 我只需要与“体育”或“足球”匹配的相关网址。 How can i search for that keyword in warc files? 如何在Warc文件中搜索该关键字? Please help me in this as i am new to common crawl. 请帮助我,因为我是普通爬网的新手。 I also checked lot of posts but none of them worked out. 我也检查了很多帖子,但都没有解决。

I need to grab article image if they have , How can i grab it as commoncrawl saving entire webpage .? 如果他们有的话,我需要抓取图片图像,我该如何抓取它作为保存整个网页的普通抓取方式?

You can use the AWS Athena to query Common Crawl Index on S3. 您可以使用AWS Athena查询S3上的Common Crawl Index。 For example, here is my SQL query to find the "sports" and "football" matching URLs in July 2019 index. 例如,这是我的SQL查询,用于在2019年7月索引中找到与``运动''和``足球''匹配的URL。 See this page - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ 看到此页面-http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

SELECT *
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2019-13'
AND subset = 'warc'
AND url_path like '%sports%' and url_path like '%football%'
Limit 10

通用爬网索引

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM