简体   繁体   中英

How to get data from a link inside a webpage in Python?

I need to collect data from the website - https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow= and store it in a dataframe using pandas. For this I use the following code and get the data quite easily -

import pandas as pd
import requests

url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="

link = requests.get(url).text
df = pd.read_html(link)
df = df[-1]

But if you notice there is another hyperlink in the table on the extreme right hand side of every row of the webpage by the name "Details". I would also like to add the data from inside that hyperlink to every row in our dataframe. How do we do that?

As suggested by Shi XiuFeng, BeautifulSoup is better suited for your problem but if you still want to proceed with your current code, you would have to use regex to extract the URLs and add them as a column like this:

import pandas as pd
import requests

url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="

link = requests.get(url)

link_content = str(link.content)
res = re.findall(r'(<tbody.*?>.*?</tbody>)', link_content)[0]
res = re.findall(r'(<a href=\"(.*?)\">Details\<\/a\>)', res)
res = [i[1] for i in res]

link_text = link.text
df = pd.read_html(link_text)
df = df[-1]

df['links'] = res
print(df)

Hope that solves your problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM