简体   繁体   中英

Reading an XML file from URL in Python

I would like to read the integers present inside the count tags .

This is the code I have written:

import xml.etree.ElementTree as ET
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url =  'http://py4e-data.dr-chuck.net/comments_42.xml'
content1 = urllib.request.urlopen(url, context = ctx).read()
soup = BeautifulSoup(content1, 'html.parser')

tree = ET.fromstring(soup)
tags = tree.findall('count')
print(tags)

It throws an error:

Traceback (most recent call last):
  File "C:\Users\Name\Desktop\Py4e\Me\Assi_15_01.py", line 15, in <module>
    tree = ET.fromstring(soup)

  File "C:\Users\Name\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1320, in XML
    parser.feed(text)
TypeError: a bytes-like object is required, not 'BeautifulSoup'

What can I do?

More information: http://py4e-data.dr-chuck.net/comments_42.xml

There's no need to use xml.etree , just select all <count> tags with BeautifulSoup:

import requests
from bs4 import BeautifulSoup


url =  'http://py4e-data.dr-chuck.net/comments_42.xml'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for c in soup.select('count'):
    print(int(c.text))

Prints:

97
97
90
90
88
87
87
80
79
79
78
76
76
72
72
66
66
65
65
64
61
61
59
58
57
57
54
51
49
47
40
38
37
36
36
32
25
24
22
21
19
18
18
14
12
12
9
7
3
2

I don't think you need to use ElementTreee. Just change BeautiflulSoup to use the lxml parser (change 'html-parser' to 'lxml') and call the findall method on soup, not tree (ie soup.findall('count')).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM