[英]I think I have pagination problem when I do webscraping
我想获取所有内容 urll(base_url + a_['href']) 并且我使用 (URL + str(page)) 对第一个链接进行分页,但我认为存在问题。 因为当我抓取 10 页时(对于范围(1,11)中的页面:)它只给了我 55 行但它必须是 260 行我不知道是什么问题。
import requests
from bs4 import BeautifulSoup as bs
import bs4
import pandas as pd
URL = 'https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher%5B%5D=0&metro%5B%5D=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=&page='
base_url = 'https://yeniemlak.az/'
urla =[]
featuress = []
for page in range(6,11):
result = requests.get(URL + str(page))
soup = bs(result.text, 'html.parser')
case = soup.find_all('table', class_ = 'list')
for fix in case:
a_ = fix.find('a')
urll = base_url + a_['href']
URLL = requests.get(urll)
soup = bs(URLL.text, 'html.parser')
aa = soup.find_all('div', class_ = 'box')
for iss in aa:
feature = (aa[0].text)
if 'Təmirli' in feature:
Təmiri = 1
else:
Təmiri = 0
urla.append(urll)
featuress.append(Təmiri)
df = pd.DataFrame({'URL':urla,'Təmiri':featuress})
df = df.drop_duplicates()
df.to_excel('jdjd.xlsx', index = False)
该站点具有 DDoS 保护,因此当服务器从 IP 接收到大量流量时,它会阻止对 IP 的服务,因此使用请求不是一种可行的方法。 另一种方法是使用 selenium 来抓取数据,因为它适用于某些具有 Cloudflare DDoS 保护的网站,例如https://www.askgamblers.com/online-casinos/reviews/casino-friday 。 希望这可以帮助。 快乐编码:)
您的requests
本身存在问题,您必须使用支持http2
的客户端,因为该站点正在使用它。
例如,您可以使用httpx
如下所示,除非您使用旋转代理,否则不要对其进行线程化。
import httpx
import trio
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd
from urllib.parse import urljoin
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Pragma": "no-cache",
"Cache-Control": "no-cache"
}
limiter = trio.CapacityLimiter(1)
async def get_soup(content):
return BeautifulSoup(content, 'html.parser', parse_only=SoupStrainer('table', attrs={'class': 'list'}))
async def worker(client, page, sender):
async with limiter, sender:
params = {
"elan_nov": "1",
"emlak": "1",
"mertebe": "",
"mertebe2": "",
"metro[]": "0",
"otaq": "",
"otaq2": "",
"page": page,
"qiymet": "",
"qiymet2": "",
"sahe_m": "",
"sahe_m2": "",
"sahe_s": "",
"sahe_s2": "",
"seher[]": "0"
}
while True:
try:
r = await client.get('axtar', params=params)
if r.is_success:
break
except httpx.RequestError:
continue
soup = await get_soup(r.content)
await sender.send([urljoin(str(client.base_url), x['href'])
for x in soup.select('td[rowspan="2"] > a')])
async def main():
async with httpx.AsyncClient(headers=headers, http2=True, base_url='https://yeniemlak.az/elan/') as client, trio.open_nursery() as nurse:
sender, receiver = trio.open_memory_channel(0)
nurse.start_soon(rec, receiver)
async with sender:
for page in range(1, 11):
nurse.start_soon(worker, client, page, sender.clone())
await trio.sleep(1)
async def rec(receiver):
allin = []
async with receiver:
async for val in receiver:
allin += val
df = pd.DataFrame(allin, columns=['URL'])
print(df)
if __name__ == "__main__":
trio.run(main)
Output:
URL
0 https://yeniemlak.az/elan/satilir-2-otaqli-bin...
1 https://yeniemlak.az/elan/satilir-5-otaqli-bin...
2 https://yeniemlak.az/elan/satilir-3-otaqli-bin...
3 https://yeniemlak.az/elan/satilir-3-otaqli-bin...
4 https://yeniemlak.az/elan/satilir-2-otaqli-bin...
.. ...
245 https://yeniemlak.az/elan/satilir-2-otaqli-bin...
246 https://yeniemlak.az/elan/satilir-2-otaqli-bin...
247 https://yeniemlak.az/elan/satilir-3-otaqli-bin...
248 https://yeniemlak.az/elan/satilir-3-otaqli-bin...
249 https://yeniemlak.az/elan/satilir-3-otaqli-bin...
[250 rows x 1 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.