简体   繁体   中英

Scraping and parsing data table using beautiful soup and python

Hi everyone so I am trying to scrape table from CIA website that shows data on roads of different countries based on unpaved and paved roads. I wrote this script to extract. Secondly I am trying to parse out information from the second column into separate fields but I don't know how to do that. After that I want to save into a CSV file with the headers for each column and data.

Here is my code:

import csv
import requests 
from bs4 import BeautifulSoup

course_list = []
url = "https://www.cia.gov/library/publications/the-world-factbook/fields/print_2085.html"
r = requests.get(url)
soup=BeautifulSoup(r.content)


for tr in soup.find_all('tr')[1:]:
          tds=tr.find_all('td')
          print (tds[1].text)

Second Column has three parts of information that I want to parse out how do I do that?

Thanks!

Depending on how you want to achieve the extraction you could do the following:

roadways = tds[1].text.strip().split('\n')

This removes some space from the beginning and end from the content of the second column and splits it by the newline character. The result would be a list like this:

['total: 97,267 km', 'paved: 18,481 km', 'unpaved: 78,786 km (2002)']

From here you could remove the labels like total or paved from the contents:

roadways = [x[x.index(':')+1:].strip() for x in tds[1].text.strip().split('\n')]

Which would result in the following list:

['97,267 km', '18,481 km', '78,786 km (2002)']

And this you can store in your CSV file:

export_file = open(..., 'w')
wr = csv.writer(export_file, quoting=csv.QUOTE_ALL)
wr.writerow(['total','paved','unpaved'])

This goes for each row you extract:

wr.writerow(roadways)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM