简体   繁体   中英

Python BeautifulSoup to csv scraping

I am attempting to scrape some simple dictionary information from an html page. So far I am able to print all the words I need on the IDE. My next step was to transfer the words to an array. My last step was to save the array as a csv file... When I run my code it seems to stop taking information after the 1309th or 1311th word, although I believe there to be over 1 million on the web page. I am stuck and would be very appreciative of any help. Thank you

from bs4 import BeautifulSoup
from urllib import urlopen
import csv

html = urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_a.html').read()

soup = BeautifulSoup(html,"lxml")

words = []

for section in soup.findAll('b'):

    words.append(section.renderContents())

print ('success')
print (len(words))

myfile = open('A.csv', 'wb')
wr = csv.writer(myfile)
wr.writerow(words)

在此输入图像描述

I was not able to reproduce the problem (always getting 11616 items), but I suspect you have outdated beautifulsoup4 or lxml versions installed. Upgrade:

pip install --upgrade beautifulsoup4
pip install --upgrade lxml

Of course, this is just a theory.

I suspect a good deal of your problem may lie in how you're processing the scraped content. Do you need to scrape all the content before you output it to the file? Or can you do it as you go?

Instead of appending over and over to a list, you should use yield .

def tokenize(soup_):
    for section in soup_.findAll('b'):
        yield section.renderContents()

This'll give you a generator that as long as section.renderContents() returns a string, the csv module can write out with no problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM