简体   繁体   中英

scraping site to move data to multiple csv columns

scraping page with multiple categories into csv. succeeding in getting first category into a column, but second column data not writing to csv. code i am using:

import urllib2
import csv
from bs4 import BeautifulSoup
url = "http://digitalstorage.journalism.cuny.edu/sandeepjunnarkar/tests/jazz.html"
page = urllib2.urlopen(url)
soup_jazz = BeautifulSoup(page)
all_years = soup_jazz.find_all("td",class_="views-field views-field-year")
all_category = soup_jazz.find_all("td",class_="views-field views-field-category-code")
with open("jazz.csv", 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow([u'Year Won', u'Category'])
    for years in all_years:
        year_won = years.string
        if year_won:
            csv_writer.writerow([year_won.encode('utf-8')])
    for categories in all_category:
        category_won = categories.string
        if category_won:
            csv_writer.writerow([category_won.encode('utf-8')])

It's writing the column headers but not the category_won into the second column.

Based on your suggestion, i have compiled it to read:

with open("jazz.csv", 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow([u'Year Won', u'Category'])
for years, categories in zip(all_years, all_category):
    year_won = years.string
    category_won = categories.string
    if year_won and category_won:
        csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])

But i have now getting the following error:

csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) ValueError: I/O operation on closed file

You could zip() the two lists together:

for years, categories in zip(all_years, all_category):
    year_won = years.string
    category_won = categories.string
    if year_won and category_won:
        csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])

Unfortunately, that HTML page is somewhat broken and you cannot search for table rows like you'd expect to be able to.

Next best thing is to search for the years, then find sibling cells:

soup_jazz = BeautifulSoup(page)
with open("jazz.csv", 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow([u'Year Won', u'Category'])
    for year_cell in soup_jazz.find_all('td', class_='views-field-year'):
        year = year_cell and year_cell.text.strip().encode('utf8')
        if not year:
            continue
        category = next((e for e in year_cell.next_siblings
                         if getattr(e, 'name') == 'td' and 
                            'views-field-category-code' in e.attrs.get('class', [])),
                        None)
        category = category and category.text.strip().encode('utf8')
        if year and category:
            csv_writer.writerow([year, category])

This produces:

Year Won,Category
2012,Best Improvised Jazz Solo
2012,Best Jazz Vocal Album
2012,Best Jazz Instrumental Album
2012,Best Large Jazz Ensemble Album
....
1960,Best Jazz Composition Of More Than Five Minutes Duration
1959,Best Jazz Performance - Soloist
1959,Best Jazz Performance - Group
1958,"Best Jazz Performance, Individual"
1958,"Best Jazz Performance, Group"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM