简体   繁体   中英

Python BS4 unwrap() scraped xml data

I'm a journalist working on a project using web scrapping to pull data from the county jail site. I'm still teaching myself python and am trying to get a list of charges and the bail that was assigned for that charge. The site uses xml, and I've been able to pull the data for charges and bail and write it to a csv file but I'm having trouble using the unwrap() function to remove tags. I've tried it out in a few places and can't seem to figure out its usage. I'd really like to do this in the code and not just have to run a find and replace in the spreadsheet.

from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime

url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
   print("Connecting to jail website:")
   print("Connected - Response code:", response)
   print("Scraping Started at ", datetime.now())

   soup = BeautifulSoup(xml.content, 'lxml')

   charges = soup.find_all('ol')
   bail_amt = soup.find_all('ob')

with open('charges-bail.csv', 'a', newline='') as csvfile:
    chargesbail = csv.writer(csvfile, delimiter=',')
    chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])

CSV File

"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...

There is no need to use the unwrap() function, you just need to access the text within an element. I suggest you search on <of> which is above both the <ol> and <ob> entries. Doing this will avoid your lists of ol and ob entries getting out of sync as not all entries have an ob .

Try the following:

from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime

url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)

if req_xml.status_code == 200:
    with open('charges-bail.csv', 'a', newline='') as csvfile:
        chargesbail = csv.writer(csvfile)
        
        print("Scraping Started at ", datetime.now())
        soup = BeautifulSoup(req_xml.content, 'lxml')

        for of in soup.find_all('of'):
            if of.ob:
                ob = of.ob.text
            else:
                ob = ''
                
            chargesbail.writerow([of.ol.text, ob])       

Which would give you an output CSV file starting:

BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000

The code of.ob.text is shorthand for: from the of find the first ob entry and then return the text contained inside or:

of.find('ob').get_text()

To only write rows when both are present, you could change it to:

for of in soup.find_all('of'):
    if of.ob and of.ob.get_text(strip=True):
        chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)]) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM