繁体   English   中英

从前两个页面抓取Web内容,然后使用python和BS4将抓取的数据导出到csv

[英]Scraping web contents from first two page and export scraped data to csv using python and BS4

我是python的新手,并且使用Python 3.6.2,并且尝试使用特定关键字从前2页抓取数据。 到目前为止,我已经能够将数据导入Python IDLE窗口,但是在将数据导出到CSV时遇到了困难。我尝试使用BeautifulSoup 4和熊猫但无法导出。 到目前为止,这是我所做的。 任何帮助将非常感激。

import csv   
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.amazon.in/s/ref=nb_sb_noss?url=search-
alias%3Dautomotive&field-
keywords=helmets+for+men&rh=n%3A4772060031%2Ck%3Ahelmets+for+men&ajr=0"
request = requests.get(url)    
soup = BeautifulSoup(request.content, "lxml")
#filename = auto.csv
#with open(str(auto.csv,"r+","\n")) as csvfile:
    #headers = "Count , Asin \n"
    #fo.writer(headers)
for url in soup.find_all('li'):
    Nand = url.get('data-asin')
    #print(Nand)   
    Result = url.get('id')
    #print(Result)
    #d=(str(Nand), str(Result))


df=pd.Index(url.get_attribute('url'))
#with open("auto.txt", "w",newline='') as dumpfile:
   #dumpfilewriter = csv.writer(dumpfile)
   #for Nand in soup:
       #value =  Nand.__gt__        
       #if value:
           #dumpfilewriter.writerows([value])
df.to_csv(dumpfile)
dumpfile.close()
csvfile.csv.writer("auto.csv," , ',' ,'|' , "\n")

问题 :帮助我将变量“ Nand”和“ Result”的数据导出到csv文件中

with open("auto.csv", 'w') as fh:
    writer = csv.DictWriter(fh, fieldnames=['Nand', 'Result'])
    writer.writeheader()
    data = {}
    for url in soup.find_all('li'):
        data['Nand'] = url.get('data-asin')
        data['Result'] = url.get('id')
        writer.writerow(data)

使用Python测试:3.4.2

我在请求站点中添加了user-agent ,以逃避自动阻止漫游器。 您得到了很多None因为您没有明确指定要使用的<li>标签。 我也将其添加到代码中。

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "http://www.amazon.in/s/ref=nb_sb_noss?url=search-alias%3Dautomotive&field-keywords=helmets+for+men&rh=n%3A4772060031%2Ck%3Ahelmets+for+men&ajr=0"
request = requests.get(url, headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'})    
soup = BeautifulSoup(request.content, "lxml")

res = []

for url in soup.find_all('li', class_ = 's-result-item'):
    res.append([url.get('data-asin'), url.get('id')])

df = pd.DataFrame(data=res, columns=['Nand', 'Result'])    
df.to_csv('path/where/you/want/to/store/file.csv')

编辑 :对于处理所有页面,您需要构建一个循环来生成url,然后将其传递到主处理块(您已经拥有)。 签出此页面: http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&page=2&keywords=helmets+for+men&ie=UTF8&qid=1501133688&spIA=B01N0MAT2E,B01MY1ZZDS,B01N0RMJ1H : http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&page=2&keywords=helmets+for+men&ie=UTF8&qid=1501133688&spIA=B01N0MAT2E,B01MY1ZZDS,B01N0RMJ1H page=2& http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&page=2&keywords=helmets+for+men&ie=UTF8&qid=1501133688&spIA=B01N0MAT2E,B01MY1ZZDS,B01N0RMJ1H

EDIT_2 :让我们在page参数上循环。 您可以手动将page添加到传递给requests.get() url中。

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = "http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&keywords=helmets+for+men&ie=UTF8"
#excluding page from base_url for further adding
res = []

for page in range(1,72): # such range is because last page for needed category is 71

    request = requests.get(base_url + '&page=' + str(page), headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}) # here adding page    
    if request.status_code == 404: #added just in case of error
        break
    soup = BeautifulSoup(request.content, "lxml")

    for url in soup.find_all('li', class_ = 's-result-item'):
        res.append([url.get('data-asin'), url.get('id')])

df = pd.DataFrame(data=res, columns=['Nand', 'Result'])    
df.to_csv('path/where/you/want/to/store/file.csv')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM