I'm trying to crawl and scrape urls from a nested XML sitemap using Python and beautiful soup.
I believe I got the first part down. I've built a simple loop to access the main XML sitemap and pull a list of XML's that match a certain criteria. Then it stores that index of XML's in a list.
The next part is where it gets fuzzy.
I'm trying to loop through each item from the above list and pull out each URL and append the output to a new list that will be written to a text file.
Here's my code for this section:
When I loop through and build the list I'm getting a weird output:
My first thought is Python is appending '/n' after each line break. But when I try to loop through the URLs I get this:
Any help or guidance would be greatly appreciated!
Cheers
Somehow python did not interpret \n as a newline character in this case (maybe cause by the marshalling of the XML contents). That's why it is not a legit URL and you got that error from requests.
A workaround would be to do a string.split("\\n")
to get back the URLs into a list.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.