简体   繁体   English

Python BeautifulSoup到csv抓取

[英]Python BeautifulSoup to csv scraping

I am attempting to scrape some simple dictionary information from an html page. 我试图从HTML页面中删除一些简单的字典信息。 So far I am able to print all the words I need on the IDE. 到目前为止,我能够在IDE上打印我需要的所有单词。 My next step was to transfer the words to an array. 我的下一步是将单词转换为数组。 My last step was to save the array as a csv file... When I run my code it seems to stop taking information after the 1309th or 1311th word, although I believe there to be over 1 million on the web page. 我的最后一步是将数组保存为csv文件...当我运行我的代码时,似乎在第1309或第1311字之后停止获取信息,尽管我相信网页上有超过100万。 I am stuck and would be very appreciative of any help. 我被困住了,非常感谢任何帮助。 Thank you 谢谢

from bs4 import BeautifulSoup
from urllib import urlopen
import csv

html = urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_a.html').read()

soup = BeautifulSoup(html,"lxml")

words = []

for section in soup.findAll('b'):

    words.append(section.renderContents())

print ('success')
print (len(words))

myfile = open('A.csv', 'wb')
wr = csv.writer(myfile)
wr.writerow(words)

在此输入图像描述

I was not able to reproduce the problem (always getting 11616 items), but I suspect you have outdated beautifulsoup4 or lxml versions installed. 我无法重现问题(总是得到11616项),但我怀疑你已经安装了过时的beautifulsoup4lxml版本。 Upgrade: 升级:

pip install --upgrade beautifulsoup4
pip install --upgrade lxml

Of course, this is just a theory. 当然,这只是一个理论。

I suspect a good deal of your problem may lie in how you're processing the scraped content. 我怀疑你的问题很多可能在于你如何处理被删除的内容。 Do you need to scrape all the content before you output it to the file? 在将所有内容输出到文件之前,是否需要删除所有内容? Or can you do it as you go? 或者你可以随时去做吗?

Instead of appending over and over to a list, you should use yield . 您应该使用yield ,而不是一遍又一遍地追加到列表中。

def tokenize(soup_):
    for section in soup_.findAll('b'):
        yield section.renderContents()

This'll give you a generator that as long as section.renderContents() returns a string, the csv module can write out with no problem. 这将给你一个生成器,只要section.renderContents()返回一个字符串,csv模块可以写出没有问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM