简体   繁体   English

Scrapy-如何对网站进行爬网并将数据存储在Microsoft SQL Server数据库中?

[英]Scrapy - How to crawl website & store data in Microsoft SQL Server database?

I'm trying to extract content from a website created by our company. 我正在尝试从我们公司创建的网站中提取内容。 I've created a table in MSSQL Server for Scrapy data. 我已经在MSSQL Server中为Scrapy数据创建了一个表。 I've also set up Scrapy and configured Python to crawl & extract webpage data. 我还设置了Scrapy并将Python配置为抓取和提取网页数据。 My question is, how do I export the data crawled by Scrapy into my local MSSQL Server database? 我的问题是,如何将Scrapy爬网的数据导出到本地MSSQL Server数据库中?

This is Scrapy's code for extracting data: 这是Scrapy用于提取数据的代码:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

You can use pymssql module to send data to SQL Server, something like this : 您可以使用pymssql模块将数据发送到SQL Server,如下所示:

import pymssql

class DataPipeline(object):
    def __init__(self):
        self.conn = pymssql.connect(host='host', user='user', password='passwd', database='db')
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        try:
            self.cursor.execute("INSERT INTO MYTABLE(text, author, tags) VALUES (%s, %s, %s)", (item['text'], item['author'], item['tags']))
            self.conn.commit()
        except pymssql.Error, e:
            print ("error")

        return item

Also, you will need to add 'spider_name.pipelines.DataPipeline' : 300 to ITEM_PIPELINES dict in setting. 另外,您需要在设置中将'spider_name.pipelines.DataPipeline' : 300添加到ITEM_PIPELINES字典。

I think the best thing to do is save the data to a CSV, and then load the CSV into your SQL Server table. 我认为最好的方法是将数据保存到CSV,然后将CSV加载到SQL Server表中。

import csv
import requests
import bs4

res = requests.get('http://www.ebay.com/sch/i.html?LH_Complete=1&LH_Sold=1&_from=R40&_sacat=0&_nkw=gerald%20ford%20autograph&rt=nc&LH_Auction=1&_trksid=p2045573.m1684')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)

# grab all the links and store its href destinations in a list
links = [e['href'] for e in soup.find_all(class_="vip")]

# grab all the bid spans and split its contents in order to get the number only
bids = [e.span.contents[0].split(' ')[0] for e in soup.find_all("li", "lvformat")]

# grab all the prices and store those in a list
prices = [e.contents[0] for e in soup.find_all("span", "bold bidsold")]

# zip each entry out of the lists we generated before in order to combine the entries
# belonging to each other and write the zipped elements to a list
l = [e for e in zip(links, prices, bids)]

# write each entry of the rowlist `l` to the csv output file
with open('ebay.csv', 'w') as csvfile:
    w = csv.writer(csvfile)
    for e in l:
        w.writerow(e)

OR 要么

import requests, bs4
import numpy as np
import requests
import pandas as pd

res = requests.get('http://www.ebay.com/sch/i.html? LH_Complete=1&LH_Sold=1&_from=R40&_sacat=0&_nkw=gerald%20ford%20autograph&r        t=nc&LH_Auction=1&_trksid=p2045573.m1684')
res.raise_for_status()
soup=bs4.BeautifulSoup(res.text, "lxml")

# grabs the link, selling price, and # of bids from historical auctions
df = pd.DataFrame()


l = []
p = []
b = []


for links in soup.find_all(class_="vip"):
    l.append(links)

for bids in soup.find_all("li", "lvformat"):
    b.append(bids)

for prices in soup.find_all("span", "bold bidsold"):
    p.append(prices)

x = np.array((l,b,p))
z = x.transpose()
df = pd.DataFrame(z)
df.to_csv('/Users/toasteez/ebay.csv')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM