如何从 URL 中提取数据？

Question

I have a xlsx file where a lot of URLs are stored along with their serial ids.我有一个 xlsx 文件，其中存储了许多 URL 及其序列号。 Each of these URLs redirects to a webpage where there is article written.这些 URL 中的每一个都重定向到写有文章的网页。 My question is how do I scan all the URLs using python and store the title and the texts of the article in a new text file with the URL serial id as its file name?我的问题是如何使用 python 扫描所有 URL 并将文章的标题和文本存储在以 URL 序列号作为文件名的新文本文件中？

Answer 1

You can do this using webscraping.您可以使用网络抓取来做到这一点。

As you said, you have a xlsx containing tuples (ids, url) .正如你所说，你有一个包含元组(ids, url)的 xlsx 。

You could start by loading this into python with :您可以首先将其加载到 python 中：

import pandas as pd

urls = pd.read_excel(filename)

Then to read the content of each URL you can use one of the most famous Web scraping library in python : BeautifulSoup .然后要阅读每个 URL 的内容，您可以使用 Python 中最著名的 Web 抓取库之一： BeautifulSoup 。

from bs4 import BeautifulSoup
import requests

# get the raw HTML from the request
content = requests.get(url).content

# build the soup
soup = BeautifulSoup(content)

# get the title
title_tag = soup.find("title") # shows <title>ActualTitle</title>

title = soup.find("title").string # show ActualTitle


# You can get the whole text contained in the page 
text_content = soup.get_text()

如何从 URL 中提取数据？

问题描述

1 个解决方案

解决方案1
0 2022-05-24 17:48:05

如何从 URL 中提取数据？

问题描述

1 个解决方案

解决方案1 0 2022-05-24 17:48:05

解决方案1
0 2022-05-24 17:48:05