[英]Python Web Scraping using BS
I have a web scraping program that gets multiple pages, but I have to set the while loop to a number. 我有一个获取多个页面的Web抓取程序,但是我必须将while循环设置为一个数字。 I want to make a condition that stops the loop once it reaches the last page or recognizes there are no more items to scrape.
我想创建一个条件,使其在到达最后一页时停止循环,或者识别出没有其他要刮擦的项目。 Assume I don't know how many pages exist.
假设我不知道有多少页。 How do I change the while loop condition to make it stop without putting a random number?
如何更改while循环条件以使其停止而不放置随机数?
import requests
from bs4 import BeautifulSoup
import csv
filename="output.csv"
f=open(filename, 'w', newline="",encoding='utf-8')
headers="Date, Location, Title, Price\n"
f.write(headers)
i=0
while i<5000:
if i==0:
page_link="https://portland.craigslist.org/search/sss?query=xbox&sort=date"
else:
page_link="https://portland.craigslist.org/search/sss?s={}&query=xbox&sort=date".format(i)
res=requests.get(page_link)
soup=BeautifulSoup(res.text,'html.parser')
for container in soup.select('.result-info'):
date=container.select('.result-date')[0].text
try:
location=container.select('.result-hood')[0].text
except:
try:
location=container.select('.nearby')[0].text
except:
location='NULL'
title=container.select('.result-title')[0].text
try:
price=container.select('.result-price')[0].text
except:
price="NULL"
print(date,location,title,price)
f.write(date+','+location.replace(","," ")+','+title.replace(","," ")+','+price+'\n')
i+=120
f.close()
I use while True
to run endless loop and break
to exit when there is no data 我使用
while True
来运行无限循环并在没有数据时break
以退出
data = soup.select('.result-info')
if not data:
print('END: no data:')
break
I use module csv
to save data so I don't have to use replace(","," ")
. 我使用
csv
模块保存数据,因此不必使用replace(","," ")
。
It will put text in " "
if there is ,
in text. 如果有
,
它将把文本放在" "
中。
s={}
can be in any place after ?
s={}
之后的任何地方都可以?
so I put it at the end to make code more readable. 所以我将其放在最后以使代码更具可读性。
Portal gives first page even if you use s=0
so I don't have to check i == 0
即使您使用
s=0
,门户网站也会提供首页,因此我不必检查i == 0
(BTW: in my code it has more readable name offset
) (顺便说一句:在我的代码中,它具有更可读的名称
offset
)
Full code. 完整代码。
import requests
from bs4 import BeautifulSoup
import csv
filename = "output.csv"
f = open(filename, 'w', newline="", encoding='utf-8')
csvwriter = csv.writer(f)
csvwriter.writerow( ["Date", "Location", "Title", "Price"] )
offset = 0
while True:
print('offset:', offset)
url = "https://portland.craigslist.org/search/sss?query=xbox&sort=date&s={}".format(offset)
response = requests.get(url)
if response.status_code != 200:
print('END: request status:', response.status)
break
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.select('.result-info')
if not data:
print('END: no data:')
break
for container in data:
date = container.select('.result-date')[0].text
try:
location = container.select('.result-hood')[0].text
except:
try:
location = container.select('.nearby')[0].text
except:
location = 'NULL'
#location = location.replace(","," ") # don't need it with `csvwriter`
title = container.select('.result-title')[0].text
try:
price = container.select('.result-price')[0].text
except:
price = "NULL"
#title.replace(",", " ") # don't need it with `csvwriter`
print(date, location, title, price)
csvwriter.writerow( [date, location, title, price] )
offset += 120
f.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.