简体   繁体   English

使用 Python 从 URL 刮取子文件

[英]Scraping subfiles from URL using Python

A webpage I would like to scrape consists of several files:我想抓取的网页包含几个文件:

网页

I'm interested of scraping only the highlighted file, that is: mboxFrame.我有兴趣只抓取突出显示的文件,即:mboxFrame。

My method of scraping pages我的页面抓取方法

import requests
from bs4 import BeautifulSoup

webPage = requests.get(URL, verify=False)

soup = BeautifulSoup(webPage.content, "html.parser" )

is able to scrape only the file mail.html.只能抓取文件mail.html Is there a way to scrape only what I want?有没有办法只我想要的东西?

I would appreciate any hints or tips.我将不胜感激任何提示或提示。

The way to open a file from a server is to request it with a URL.从服务器打开文件的方法是使用 URL 请求它。 In fact, in the beginnings of the world wide web this was the only way to get content: content creators would put various files on servers and clients would open or download those files.事实上,在世界范围内 web 的开始,这是获取内容的唯一方法:内容创建者将各种文件放在服务器上,客户端将打开或下载这些文件。 The dynamic processing of URIs and parameters is a later invention. URI 和参数的动态处理是后来的发明。 That is why commenters are asking for the URL you use.这就是为什么评论者要求您使用 URL。 We want to see it and modify accordingly to help you see what parts need changing in order to get that particular file.我们希望查看它并进行相应修改,以帮助您查看需要更改哪些部分才能获取该特定文件。 You can omit the password, or replace it with some other string of letters.您可以省略密码,或将其替换为其他字符串。

In general, the file you want would be under the url you use, but ending with the file name.通常,您想要的文件将在您使用的 url 下,但以文件名结尾。 If the startong URL is www.example.com/mail/ , then this file would be at www.example.com/mail/mbox.msc .如果 startong URL 是www.example.com/mail/ ,那么这个文件将在www.example.com/mail/mbox.msc

Please note that any parameters should follow the path, so www.example.com/mail?user=hendrra&password=hendras_password would turn into www.example.com/mail/mbox.msc?user=hendrra&password=hendras_password请注意,任何参数都应遵循路径,因此www.example.com/mail?user=hendrra&password=hendras_password会变成www.example.com/mail/mbox.msc?user=hendrra&password=hendras_password

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM