繁体   English   中英

我认为我在进行网页抓取时遇到了分页问题

[英]I think I have pagination problem when I do webscraping

我想获取所有内容 urll(base_url + a_['href']) 并且我使用 (URL + str(page)) 对第一个链接进行分页,但我认为存在问题。 因为当我抓取 10 页时(对于范围(1,11)中的页面:)它只给了我 55 行但它必须是 260 行我不知道是什么问题。

import requests
from bs4 import BeautifulSoup as bs
import bs4
import pandas as pd

URL = 'https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher%5B%5D=0&metro%5B%5D=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=&page='

base_url = 'https://yeniemlak.az/'

urla =[]
featuress = []

for page in range(6,11):
    result = requests.get(URL + str(page))
    soup = bs(result.text, 'html.parser')
    case = soup.find_all('table', class_ = 'list')
    for fix in case:
        a_ = fix.find('a')
        urll = base_url + a_['href']
        URLL = requests.get(urll)
        soup = bs(URLL.text, 'html.parser')
        aa = soup.find_all('div', class_ = 'box')
        for iss in aa:
            feature = (aa[0].text)
            if 'Təmirli' in feature:
                Təmiri  = 1
            else:
                Təmiri = 0    
            urla.append(urll)
            featuress.append(Təmiri)            
            df = pd.DataFrame({'URL':urla,'Təmiri':featuress})
            df = df.drop_duplicates() 
            df.to_excel('jdjd.xlsx', index = False)


该站点具有 DDoS 保护,因此当服务器从 IP 接收到大量流量时,它会阻止对 IP 的服务,因此使用请求不是一种可行的方法。 另一种方法是使用 selenium 来抓取数据,因为它适用于某些具有 Cloudflare DDoS 保护的网站,例如https://www.askgamblers.com/online-casinos/reviews/casino-friday 希望这可以帮助。 快乐编码:)

您的requests本身存在问题,您必须使用支持http2的客户端,因为该站点正在使用它。

例如,您可以使用httpx如下所示,除非您使用旋转代理,否则不要对其进行线程化。

import httpx
import trio
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd
from urllib.parse import urljoin

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache"
}

limiter = trio.CapacityLimiter(1)


async def get_soup(content):
    return BeautifulSoup(content, 'html.parser', parse_only=SoupStrainer('table', attrs={'class': 'list'}))


async def worker(client, page, sender):
    async with limiter, sender:
        params = {
            "elan_nov": "1",
            "emlak": "1",
            "mertebe": "",
            "mertebe2": "",
            "metro[]": "0",
            "otaq": "",
            "otaq2": "",
            "page": page,
            "qiymet": "",
            "qiymet2": "",
            "sahe_m": "",
            "sahe_m2": "",
            "sahe_s": "",
            "sahe_s2": "",
            "seher[]": "0"
        }
        while True:
            try:
                r = await client.get('axtar', params=params)
                if r.is_success:
                    break
            except httpx.RequestError:
                continue
        soup = await get_soup(r.content)
        await sender.send([urljoin(str(client.base_url), x['href'])
                           for x in soup.select('td[rowspan="2"] > a')])


async def main():
    async with httpx.AsyncClient(headers=headers, http2=True, base_url='https://yeniemlak.az/elan/') as client, trio.open_nursery() as nurse:
        sender, receiver = trio.open_memory_channel(0)
        nurse.start_soon(rec, receiver)
        async with sender:
            for page in range(1, 11):
                nurse.start_soon(worker, client, page, sender.clone())
                await trio.sleep(1)


async def rec(receiver):
    allin = []
    async with receiver:
        async for val in receiver:
            allin += val
    df = pd.DataFrame(allin, columns=['URL'])
    print(df)

if __name__ == "__main__":
    trio.run(main)

Output:

                                                   URL
0    https://yeniemlak.az/elan/satilir-2-otaqli-bin...
1    https://yeniemlak.az/elan/satilir-5-otaqli-bin...
2    https://yeniemlak.az/elan/satilir-3-otaqli-bin...
3    https://yeniemlak.az/elan/satilir-3-otaqli-bin...
4    https://yeniemlak.az/elan/satilir-2-otaqli-bin...
..                                                 ...
245  https://yeniemlak.az/elan/satilir-2-otaqli-bin...
246  https://yeniemlak.az/elan/satilir-2-otaqli-bin...
247  https://yeniemlak.az/elan/satilir-3-otaqli-bin...
248  https://yeniemlak.az/elan/satilir-3-otaqli-bin...
249  https://yeniemlak.az/elan/satilir-3-otaqli-bin...

[250 rows x 1 columns]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM