簡體 English 中英

無法從從普通爬網爬網的warc文件中找到URL

[英]cannot find url from a warc file crawled from common crawl

原文 2017-07-17 11:56:45 4 1 python/ record/ common-crawl/ warc

我已經從常規爬網中爬網了數據，我想找出與每個記錄相對應的url。

for record in files:
     print record['WARC-Target-URI']

這將輸出一個空列表。 我指的是以下鏈接https://dmorgan.info/posts/common-crawl-python/ 。 我們是否獲得對應於每個記錄的目標uri或僅一個warc文件路徑的一個目標uri？

1 個解決方案

您關注的信息是標題的一部分。 嘗試：

print record.header['WARC-Target-URI']

根據URL從WARC文件中檢索記錄

[英]Retrieving records from WARC file based on url

如何從Warc文件中讀取記錄的子集

[英]How to read a subset of records from a warc file

從WARC.gz文件中提取標頭

[英]Extracting headers from WARC.gz file

如何從 Common Crawl 獲取網頁文本？

[英]How to get webpage text from Common Crawl?

通過python從網站爬網的圖像無法通過Photoshop打開

[英]Images crawled by python from website cannot be open by photoshop

抓取抓取：抓取0頁

[英]Scrapy crawl: Crawled 0 pages

我想看看新發現的美麗湯鏈接是否已經在 queue.txt 文件和 crawled.txt 文件中

[英]I would like to find if the new found links from Beautiful soup is already in the queue.txt file and crawled.txt file

在2個文件中查找公共行，從file1寫入公共行，從文件2寫入非公共行

[英]Find common lines in 2 files, write common line from file1 and non common line from file 2

Python無法完全讀取“ warc.gz”文件

[英]Python cannot read “warc.gz” file completely

嘗試將數據從爬網導出到csv文件

[英]Trying to export the data from the crawl to a csv file

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 根據URL從WARC文件中檢索記錄如何從Warc文件中讀取記錄的子集從WARC.gz文件中提取標頭如何從 Common Crawl 獲取網頁文本？通過python從網站爬網的圖像無法通過Photoshop打開抓取抓取：抓取0頁我想看看新發現的美麗湯鏈接是否已經在 queue.txt 文件和 crawled.txt 文件中在2個文件中查找公共行，從file1寫入公共行，從文件2寫入非公共行 Python無法完全讀取“ warc.gz”文件嘗試將數據從爬網導出到csv文件

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM