![](/img/trans.png)
[英]Web Scraping, How to extract data from two same tags using bs4 in python
[英]Scraping web contents from first two page and export scraped data to csv using python and BS4
我是python的新手,并且使用Python 3.6.2,并且尝试使用特定关键字从前2页抓取数据。 到目前为止,我已经能够将数据导入Python IDLE窗口,但是在将数据导出到CSV时遇到了困难。我尝试使用BeautifulSoup 4和熊猫但无法导出。 到目前为止,这是我所做的。 任何帮助将非常感激。
import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.amazon.in/s/ref=nb_sb_noss?url=search-
alias%3Dautomotive&field-
keywords=helmets+for+men&rh=n%3A4772060031%2Ck%3Ahelmets+for+men&ajr=0"
request = requests.get(url)
soup = BeautifulSoup(request.content, "lxml")
#filename = auto.csv
#with open(str(auto.csv,"r+","\n")) as csvfile:
#headers = "Count , Asin \n"
#fo.writer(headers)
for url in soup.find_all('li'):
Nand = url.get('data-asin')
#print(Nand)
Result = url.get('id')
#print(Result)
#d=(str(Nand), str(Result))
df=pd.Index(url.get_attribute('url'))
#with open("auto.txt", "w",newline='') as dumpfile:
#dumpfilewriter = csv.writer(dumpfile)
#for Nand in soup:
#value = Nand.__gt__
#if value:
#dumpfilewriter.writerows([value])
df.to_csv(dumpfile)
dumpfile.close()
csvfile.csv.writer("auto.csv," , ',' ,'|' , "\n")
问题 :帮助我将变量“ Nand”和“ Result”的数据导出到csv文件中
with open("auto.csv", 'w') as fh:
writer = csv.DictWriter(fh, fieldnames=['Nand', 'Result'])
writer.writeheader()
data = {}
for url in soup.find_all('li'):
data['Nand'] = url.get('data-asin')
data['Result'] = url.get('id')
writer.writerow(data)
使用Python测试:3.4.2
我在请求站点中添加了user-agent
,以逃避自动阻止漫游器。 您得到了很多None
因为您没有明确指定要使用的<li>
标签。 我也将其添加到代码中。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.amazon.in/s/ref=nb_sb_noss?url=search-alias%3Dautomotive&field-keywords=helmets+for+men&rh=n%3A4772060031%2Ck%3Ahelmets+for+men&ajr=0"
request = requests.get(url, headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'})
soup = BeautifulSoup(request.content, "lxml")
res = []
for url in soup.find_all('li', class_ = 's-result-item'):
res.append([url.get('data-asin'), url.get('id')])
df = pd.DataFrame(data=res, columns=['Nand', 'Result'])
df.to_csv('path/where/you/want/to/store/file.csv')
编辑 :对于处理所有页面,您需要构建一个循环来生成url,然后将其传递到主处理块(您已经拥有)。 签出此页面: http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&page=2&keywords=helmets+for+men&ie=UTF8&qid=1501133688&spIA=B01N0MAT2E,B01MY1ZZDS,B01N0RMJ1H
: http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&page=2&keywords=helmets+for+men&ie=UTF8&qid=1501133688&spIA=B01N0MAT2E,B01MY1ZZDS,B01N0RMJ1H
page=2& http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&page=2&keywords=helmets+for+men&ie=UTF8&qid=1501133688&spIA=B01N0MAT2E,B01MY1ZZDS,B01N0RMJ1H
。
EDIT_2 :让我们在page
参数上循环。 您可以手动将page
添加到传递给requests.get()
url中。
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = "http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&keywords=helmets+for+men&ie=UTF8"
#excluding page from base_url for further adding
res = []
for page in range(1,72): # such range is because last page for needed category is 71
request = requests.get(base_url + '&page=' + str(page), headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}) # here adding page
if request.status_code == 404: #added just in case of error
break
soup = BeautifulSoup(request.content, "lxml")
for url in soup.find_all('li', class_ = 's-result-item'):
res.append([url.get('data-asin'), url.get('id')])
df = pd.DataFrame(data=res, columns=['Nand', 'Result'])
df.to_csv('path/where/you/want/to/store/file.csv')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.