简体   繁体   English

如何从 Python 中网页内的链接获取数据?

[英]How to get data from a link inside a webpage in Python?

I need to collect data from the website - https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow= and store it in a dataframe using pandas.我需要从网站收集数据 - https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow=并将其存储在 dataframe 中使用 Z2A405220F883225 For this I use the following code and get the data quite easily -为此,我使用以下代码并很容易地获取数据 -

import pandas as pd
import requests

url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="

link = requests.get(url).text
df = pd.read_html(link)
df = df[-1]

But if you notice there is another hyperlink in the table on the extreme right hand side of every row of the webpage by the name "Details".但是,如果您注意到网页每一行最右侧的表格中还有另一个超链接,名称为“详细信息”。 I would also like to add the data from inside that hyperlink to every row in our dataframe.我还想将该超链接内部的数据添加到我们的 dataframe 中的每一行。 How do we do that?我们如何做到这一点?

As suggested by Shi XiuFeng, BeautifulSoup is better suited for your problem but if you still want to proceed with your current code, you would have to use regex to extract the URLs and add them as a column like this:正如 Shi XiuFeng 所建议的,BeautifulSoup 更适合您的问题,但如果您仍想继续使用当前代码,则必须使用正则表达式来提取 URL 并将它们添加为这样的列:

import pandas as pd
import requests

url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="

link = requests.get(url)

link_content = str(link.content)
res = re.findall(r'(<tbody.*?>.*?</tbody>)', link_content)[0]
res = re.findall(r'(<a href=\"(.*?)\">Details\<\/a\>)', res)
res = [i[1] for i in res]

link_text = link.text
df = pd.read_html(link_text)
df = df[-1]

df['links'] = res
print(df)

Hope that solves your problem.希望能解决您的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM