简体   繁体   中英

What is the ideal way to use xml data in python html parsing with Beautiful Soup?

What is the ideal way to convert xml to text in python html parsing with Beautiful Soup?

When I am doing html parsing with Python 2.7 BeautifulSoup library, I can get to the step to "soup", but I have no idea how to extract the data I need, so I tried converting them all to string.

In the following example, I want to extract all number in the span tag and add them up. Is there a better way?

XML data: http://python-data.dr-chuck.net/comments_324255.html

CODE:

import urllib2
from BeautifulSoup import *
import re

url = 'http://python-data.dr-chuck.net/comments_324255.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
spans = soup('span')
lis = list()
span_str = str(spans)
sp = re.findall('([0-9]+)', span_str)
count = 0
for i in sp:
    count = count + int(i)
print('Sum:', count)

Don't need regex:

from bs4 import BeautifulSoup
from requests import get

url = 'http://python-data.dr-chuck.net/comments_324255.html'
html = get(url).text
soup = BeautifulSoup(html, 'lxml')

count = sum(int(n.text) for n in soup.findAll('span'))
import requests, bs4
r = requests.get("http://python-data.dr-chuck.net/comments_324255.html")
soup = bs4.BeautifulSoup(r.text, 'lxml')

sum(int(span.text) for span in soup.find_all(class_="comments"))

output:

2788

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM