简体繁体 English

针对特定文件类型爬取 web

[英]Crawling web for specific file type

原文 2011-07-13 15:04:15 4 5 python/ screen-scraping/ web-crawler

As a part of a research, I need to download freely available RDF (Resource Description Framework - *.rdf) files via web, as much as possible.作为研究的一部分，我需要尽可能多地通过 web 下载免费可用的 RDF（资源描述框架 - *.rdf）文件。 What are the ideal libraries/frameworks available in Python for doing this? Python 中可用的理想库/框架是什么？

Are there any websites/search engines capable of doing this?是否有任何网站/搜索引擎能够做到这一点？ I've tried Google filetype:RDF search.我试过谷歌文件类型：RDF 搜索。 Initially, Google shows you 6,960,000 results.最初，Google 会向您显示 6,960,000 个结果。 However, as you browse individual results pages, the results drastically drop down to 205 results.但是，当您浏览单个结果页面时，结果会急剧下降到 205 个结果。 I wrote a script to screen-scrape and download files, but 205 is not enough for my research and I am sure there are more than 205 files in the web.我写了一个脚本来抓取和下载文件，但是 205 对我的研究来说还不够，我确信 web 中有超过 205 个文件。 So, I really need a file crawler.所以，我真的需要一个文件爬虫。 I'd like to know whether there are any online or offline tools that can be used for this purpose or frameworks/sample scripts in Python to achieve this.我想知道是否有任何在线或离线工具可用于此目的，或者 Python 中的框架/示例脚本可以实现此目的。 Any help in this regards is highly appreciated.非常感谢这方面的任何帮助。

5 个解决方案

Crawling RDF content from the Web is no different than crawling any other content.从 Web 抓取 RDF 内容与抓取任何其他内容没有什么不同。 That said, if your question is "what is a good python Web crawler", than you should read this question: Anyone know of a good Python based web crawler that I could use? That said, if your question is "what is a good python Web crawler", than you should read this question: Anyone know of a good Python based web crawler that I could use? . . If your question is related to processing RDF with python, then there are several options, one being RDFLib如果您的问题与使用 python 处理 RDF 有关，那么有几个选项，一个是RDFLib

I know that I'm a bit late with this answer - but for future searchers - http://sindice.com/ is a great index of rdf documents我知道我的这个答案有点晚了 - 但对于未来的搜索者 - http://sindice.com/是 rdf 文档的一个很好的索引

teleport pro, although it maybe cant copy from google, too big, it can probably handly proxy sites that return google results, and i know, for a fact, i could download 10 000 pdfs with in a day if i wanted to. Teleport pro，虽然它可能无法从 google 复制，但它太大了，它可能可以处理返回 google 结果的代理站点，而且我知道，事实上，如果我愿意的话，我可以在一天内下载 10 000 个 pdf。 it has filetype specifiers and many options.它有文件类型说明符和许多选项。

here's one workaround:这是一种解决方法：

get "download master" from chrome extensions, or similar program从 chrome 扩展程序或类似程序中获取“下载大师”

search on google or other for results, set google to 100 per page在谷歌或其他搜索结果，将谷歌设置为每页 100

select - show all files select - 显示所有文件

write your file extension, .rdf press enter写下你的文件扩展名，.rdf 按回车

press download按下载

you can have 100 files per click, not bad.每次点击可以有 100 个文件，还不错。

Did you notice the text something like "google has hidden similar results, click here to show all results" at the bottom of one page?您是否注意到一页底部的文字“google has hidden similar results, click here to show all results”之类的文字？ Might help.可能有帮助。