简体   繁体   中英

How can I replace the href of links in lxml with Python?

I'm trying to scrape a website using lxml.

What I'm trying to do is, get the html of the web page, get all of the stylesheet links on the page, and then replace those links with some updated links so I can insert all of the html with the new updated links into a new html file.

The code I have so far is this:

import requests
from lxml import etree
from lxml import html

page = requests.get('https://www.flashscore.co.uk/basketball/') 

root = html.fromstring(page.content)

def get_original_list():
    original_list = []
    stylesheets = root.xpath('//link')
    for link in stylesheets:
        if link.get('href'):
            if link.get('href').startswith('/') == True:
                original_list.append(link.get('href'))

    return original_list

def get_new_list():
    original_list = get_original_list()

    new_list = []
    for x in original_list:
        new_list.append(x.lstrip('/'))

    return new_list

def replace_links(root):
    og_list = get_original_list()
    n_list = get_new_list()

    for o, n in zip(og_list, n_list):
        print(o, n)
        get_tree = etree.tostring(root).decode()
        get_tree.replace(o, n)

    print(get_tree)

replace_links(root)

I'm stuck on replacing the links. How can I get the html of a page and replace the href of the links so I can open a file and save the html file.

The Elements are Lists section on this site should help

This may also help

You should simply be able to assign and add values to the root. Then because youre scraping from the web and not using a local file, you can create and write to a new html file with open()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM