简体   繁体   English

使用python进行网络抓取并将数据传输到excel中

[英]web scraping and transferring data into excel using python

im able to fully scrap the material i needed the problem is i cant get the data into excel.我能够完全废弃我需要的材料,问题是我无法将数据导入 excel。

from lxml import html
import requests
import xlsxwriter

page = requests.get('website that gets mined')
tree = html.fromstring(page.content)




items = tree.xpath('//h4[@class="item-title"]/text()')
prices = tree.xpath('//span[@class="price"]/text()')
description = tree.xpath('//div[@class="description text"]/text()')
print 'items: ', items
print 'Prices: ', prices
print 'description', description

everything works fine until this section where i try to get the data into excel this is the error message:一切正常,直到本节我尝试将数据导入 excel 这是错误消息:

for items,prices,description in (array):
ValueError: too many values to unpack
Exception Exception: Exception('Exception caught in workbook destructor. Explicit close() may be required for workbook.',) in <bound method Workbook.__del__ of <xlsxwriter.workbook.Workbook object at 0x104735e10>> ignored

this is what it was trying to do这就是它试图做的

array = [items,prices,description]
workbook   = xlsxwriter.Workbook('test1.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0

for items,prices,description in (array):
    worksheet.write(row, col, items)
    worksheet.write(row, col + 1, prices)
    worksheet.write(row, col + 2, description)
    row += 1
workbook.close()

Assuming that "items,prices,description" all have the same length, you could rewrite the final part of the code in :假设“items,prices,description”都具有相同的长度,你可以重写代码的最后部分:

for item,price,desc in zip(items,prices,description)
    worksheet.write(row, col, item)
    worksheet.write(row, col + 1, price)
    worksheet.write(row, col + 2, desc)
    row += 1

If the lists can have unequal lengths you should check this for alternatives for the zip method, but I would be worried for the data consistency.如果列表可以有不同的长度,你应该检查为替代zip方法,但我担心的数据一致性。

Inevitably, it will be easier to write to a CSV file, or a Text file, rather than an Excel file.不可避免地,写入 CSV 文件或文本文件会比 Excel 文件更容易。

import urllib2

listOfStocks = ["AAPL", "MSFT", "GOOG", "FB", "AMZN"]

urls = []

for company in listOfStocks:
    urls.append('http://real-chart.finance.yahoo.com/table.csv?s=' + company + '&d=6&e=28&f=2015&g=m&a=11&b=12&c=1980&ignore=.csv')

Output_File = open('C:/your_path_here/Data.csv','w')

New_Format_Data = ''

for counter in range(0, len(urls)):

    Original_Data = urllib2.urlopen(urls[counter]).read()

    if counter == 0:
        New_Format_Data = "Company," + urllib2.urlopen(urls[counter]).readline()

    rows = Original_Data.splitlines(1)

    for row in range(1, len(rows)):

        New_Format_Data = New_Format_Data + listOfStocks[counter] + ',' + rows[row]

Output_File.write(New_Format_Data)
Output_File.close()

OR或者

from bs4 import BeautifulSoup
import urllib2

var_file = urllib2.urlopen("http://www.imdb.com/chart/top")

var_html  = var_file.read()

text_file = open("C:/your_path_here/Text1.txt", "wb")
var_file.close()
soup = BeautifulSoup(var_html)
for item in soup.find_all(class_='lister-list'):
    for link in item.find_all('a'):
        #print(link)
        z = str(link)
        text_file.write(z + "\r\n")
text_file.close()

As a developer, it's difficult to programmatically manipulate Excel files since the Excel is proprietary.作为开发人员,很难以编程方式操作 Excel 文件,因为 Excel 是专有的。 This is especially true for languages other than .NET.对于 .NET 以外的语言尤其如此。 On the other hand, for a developer it's easy to programmatically manipulate CSV since, after all, they are simple text files.另一方面,对于开发人员来说,以编程方式操作 CSV 很容易,因为它们毕竟是简单的文本文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM