Python脚本被杀死而没有错误

Question

我正在运行一个脚本，该脚本下载其中带有html标记的xls文件，并剥离它们以创建干净的csv文件。

码：

#!/usr/bin/env python

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
import sys
#from pympler.asizeof import asizeof
from pympler import muppy
from pympler import summary

f = urlopen('http://localhost/Classes/sample.xls') #This is 75KB
#f = urlopen('http://supplier.com/xmlfeed/products.xls') #This is 75MB
soup = BeautifulSoup(f)
stable = soup.find('table')
print 'table found'
rows = []
for row in stable.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('th')])
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])

#print sys.getsizeof(rows)
#print asizeof(rows)

print 'row list created'
soup.decompose()
print 'soup decomposed'
f.close()
print 'file closed'

with open('output_file.csv', 'wb') as file:
    writer = csv.writer(file)
    print 'writer started'
    #writer.writerow(headers)
    writer.writerows(row for row in rows if row)

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

上面的代码非常适合75KB文件，但是对于75MB文件，该过程将被杀死，而不会发生任何错误。

我对漂亮的汤和python很陌生，请帮助我确定问题所在。 该脚本在3GB RAM上运行。

小文件的输出为：

table found
row list created
soup decomposed
file closed
writer started
                                types |   # objects |   total size
===================================== | =========== | ============
                                 dict |        5615 |      4.56 MB
                                  str |        8457 |    713.23 KB
                                 list |        3525 |    375.51 KB
  <class 'bs4.element.NavigableString |        1810 |    335.76 KB
                                 code |        1874 |    234.25 KB
              <class 'bs4.element.Tag |        3097 |    193.56 KB
                              unicode |        3102 |    182.65 KB
                                 type |         137 |    120.95 KB
                   wrapper_descriptor |        1060 |     82.81 KB
           builtin_function_or_method |         718 |     50.48 KB
                    method_descriptor |         580 |     40.78 KB
                              weakref |         416 |     35.75 KB
                                  set |         137 |     35.04 KB
                                tuple |         431 |     31.56 KB
                  <class 'abc.ABCMeta |          20 |     17.66 KB

我不明白什么是“ dict”，它占用了75KB文件更多的内存。

谢谢，

Answer 1

没有实际的文件很难说，但是您可以做的是避免创建中间的行列表并直接写入打开的csv文件。

另外，您可以让BeautifulSoup在lxml.html使用lxml.html （应安装lxml ）。

改进的代码：

#!/usr/bin/env python

from urllib2 import urlopen
import csv

from bs4 import BeautifulSoup    

f = urlopen('http://localhost/Classes/sample.xls')
soup = BeautifulSoup(f, 'lxml')

with open('output_file.csv', 'wb') as file:
    writer = csv.writer(file)

    for row in soup.select('table tr'):
        writer.writerows(val.text.encode('utf8') for val in row.find_all('th') if val)
        writer.writerows(val.text.encode('utf8') for val in row.find_all('td') if val)

soup.decompose()
f.close()

Python脚本被杀死而没有错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-01-04 05:48:08

Python脚本被杀死而没有错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-01-04 05:48:08

解决方案1
1 已采纳 2015-01-04 05:48:08