简体   繁体   English

在python中从HTML中提取URL

[英]Extracting URL from HTML in python

I am trying to obtain a list containing different url that appear (partially) when you see the HTML version of this webpage:我试图获取一个包含不同 url 的列表,当您看到此网页的 HTML 版本时(部分):

https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas

I have tried a couple of different things, yet they don't really work.我尝试了几种不同的方法,但它们并没有真正起作用。

First attempt第一次尝试

from bs4 import BeautifulSoup
import requests
import html
import urllib
import json
import re

url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('div', class_ = "rftabdetailline accordion aem-GridColumn aem-GridColumn--default--12")

links contains the following:链接包含以下内容:

 [<div class="rftabdetailline accordion aem-GridColumn aem-GridColumn--default--12">
 <!-- rf-tab-detail-line en resto de modos -->
 <rf-tab-detail-line content='[{"color":"120,180,225","name":"C1","active":"true","stations":"València Nord \u2013 Gandía","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html"},{"color":"245,150,40","name":"C2","active":"false","stations":"València Nord \u2013 Xàtiva \u2013 Moixent","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html"},{"color":"125,37,130","name":"C3","active":"false","stations":"València Sant Isidre \u2013 Buñol \u2013 Utiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html"},{"color":"215,0,30","name":"C4","active":"false","stations":"València Sant Isidre \u2013 Xirivella L\u2019Alter","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html"},{"color":"0,139,41","name":"C5","active":"false","stations":"València Nord \u2013 Caudiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html"},{"color":"15,50,135","name":"C6","active":"false","stations":"València Nord \u2013 Castelló","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html"},{"color":"150,100,40","name":"ER02","active":"false","stations":"Castelló - Vinaròs","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html"}]' title-text="Seleccione una línea:">
 </rf-tab-detail-line>
 </div>]

In it, you can see the pieces that I want: for example, * "url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html" *.在其中,您可以看到我想要的部分:例如,* "url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1。 html" *. I would like to obtain all the different /content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_WHATEVER.html in a list.我想在列表中获取所有不同的/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_WHATEVER.html In order to do so, I have tried an extract and using regular expressions, but I have not been succesful.为此,我尝试了提取和使用正则表达式,但没有成功。

Second Attempt第二次尝试

Following the steps that are shown in the answer to this question Extractinf info form HTML that has no tags I obtained the next piece of code:按照这个问题的答案中显示的步骤Extractinf info form HTML that has no tags我获得了下一段代码:

import requests
import html
import json

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
data = response.text  # get data from site
raw_list = data.split("'")[8]  # extract attributes
json_list = html.unescape(raw_list)  # decode html symbols
parsed_list = json.loads(json_list)  # parse json 

I thought that it would work because of the similarities in the output it produces, but when defining parsed_list the next error is returned:我认为它会起作用,因为它产生的输出具有相似性,但是在定义 parsed_list 时会返回下一个错误:

  • JSONDecodeError : Expecting value: line 1 column 1 (char 0)* JSONDecodeError :期望值:第 1 行第 1 列(字符 0)*

Does anyone have anythoughts?有人有想法吗? Thank you all in advance!!!谢谢大家!!!

This way:这条路:

import html
import json
import re
import requests

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
page_text = response.text  # get data from site

regex = r"<rf-tab-detail-line title-text=\"Seleccione una línea:\" content=\"([^\"]+)"
encoded_content = re.findall(regex, page_text)

if len(encoded_content) == 0:
    print("Nothing found, possibly page structure changed.")
    exit()

encoded_content = html.unescape(encoded_content[0])
json_content = json.loads(encoded_content)

for item in json_content:
    print(item["url"])

Output:输出:

/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html

Hope this is what you needed.希望这是你所需要的。

I would instead use a css attribute = value selector to target the single element housing that data as it is more intuitive upon reading.我会改为使用 css attribute = value 选择器来定位包含该数据的单个元素,因为它在阅读时更直观。 Then you simply need to extract the content attribute and handle with json library filtering for the url key value pairs.然后您只需要提取content属性并使用json库过滤来处理url键值对。

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = json.loads(soup.select_one('[title-text="Seleccione una línea:"]')['content'])
links = [i['url'] for i in data]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM