簡體   English   中英

使用圖像和數據從網站上抓取數據

[英]Scrape data from a website with images and data

我需要從https://www.airliners.net/獲取以下所有數據, https://www.airliners.net/點擊最近7天。 然后出現一張飛機照片列表。 是否有可能遍歷所有這些。 第一張圖片的例子。 得到

Aeroflot-Russian Airlines / Sukhoi SSJ-100-95-LR-100 Superjet 100 (RRJ-95LR) / 
Moscow - Sheremetyevo (SVO / UUEE) / Russia - May 5, 2019 / REG: RA-89098 / MSN: 95135

在這個例子中,有56頁要循環。 目前,我必須花費整個周末復制和粘貼我的航空項目。 希望有可能使用python解決這個問題

我試圖使用一些網絡抓取代碼,但無法讓它工作

我想將數據保存在逗號分隔文件或csv文件中(如果可能)。

這可能有助於不是100%測試但是是一些東西。

# -*- coding: utf-8 -*-

import pandas 
import requests
import lxml.html

from sys import exit
from pprint import pprint

data = []
with requests.Session() as session:

    loop = 1
    while True:

        response = session.get('https://www.airliners.net/search', headers={
                'authority': 'www.airliners.net',
                'upgrade-insecure-requests': '1',
                'dnt': '1',
                'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
                'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
                'referer': 'https://www.airliners.net/',
                'accept-encoding': 'gzip, deflate, br',
                'accept-language': 'es-ES,es;q=0.9,en;q=0.8',
            }, params=(
                ('dateAccepted', '7'),
                ('sortBy', 'viewCount'),
                ('page', loop)
            ))

        root = lxml.html.fromstring(response.content.strip().decode("utf-8"))
        elements = root.xpath('//div[@class="ps-v2-results ps-v2-results-display-detail photo-grid"]/div')

        if len(elements) == 0:
            break

        # Each element 
        for index,row in enumerate(elements):
            element = row.xpath('div/div')

            if 'spacer' in row.xpath('@class')[0]:
                continue

            caption = None 
            state = None 

            try:
                photo,aircraft,id_number,location_date,photographer = element
            except ValueError:
                # ps-v2-results-col ps-v2-results-col-caption
                photo,aircraft,id_number,location_date,photographer,caption = element

            # ps-v2-results-col ps-v2-results-col-photo
            # ps-v2-results-col ps-v2-results-col-aircraft
            # ps-v2-results-col ps-v2-results-col-id-numbers
            # ps-v2-results-col ps-v2-results-col-location-date
            # ps-v2-results-col ps-v2-results-col-photographer

            # Photo
            photo = photo.xpath('div[2]/div/a/img/@src')[0].strip()

            # Arcraft
            try:
                aircraft = aircraft.xpath('div[2]/div/div[2]/a/text()')[0].strip()
            except IndexError:
                aircraft = None

            # Reg , MSN
            try:
                reg,msn = id_number.xpath('div[2]/div/div')
                reg = reg.xpath('a/text()')[0].strip()
                msn = msn.xpath('a/text()')[0].strip()

            except ValueError:

                try:
                    reg = id_number.xpath('div[2]/div/div/a/text()')[0].strip()
                except IndexError:
                    reg = None 

                msn = None 

            # Location, Date
            city,date = location_date.xpath('div[2]/div/div')
            city = city.xpath('a/text()')[0].strip()

            try:
                country,date = date.xpath('a')
            except ValueError:
                try:
                    state,country,date = date.xpath('a')
                except  ValueError:
                    state,country,date = (None,None,None)
                else:
                    country = country.xpath('text()')[0].strip()
                    date = date.xpath('text()')[0].strip()

            if state is not None:
                state = state.xpath('text()')[0].strip()

            # Photographer
            photographer = photographer.xpath('div[2]/div/div/div/div/div[1]/a/text()')[0].strip()

            # Caption
            if caption is not None:
                caption = caption.xpath('div[2]/text()')[0].strip()

            data.append({
                'photo' :photo,
                'aircraft' : aircraft,
                'reg' : reg,
                'msn' : msn,
                'city' : city,
                'date': date,
                'country': country,
                'photographer': photographer,
                'caption' : caption,
                'state' : state
            })


        print 'LOOP',loop
        loop += 1

print "Total " , len(data), "items"
df = pandas.DataFrame(data)
df.to_csv('data.csv',encoding='utf-8',index= False)

日志:

LOOP 1
LOOP 2
LOOP 3
LOOP 4
LOOP 5
LOOP 6
LOOP 7
LOOP 8
LOOP 9
LOOP 10
LOOP 11
LOOP 12
LOOP 13
LOOP 14
LOOP 15
LOOP 16
LOOP 17
LOOP 18
LOOP 19
LOOP 20
LOOP 21
LOOP 22
LOOP 23
LOOP 24
LOOP 25
LOOP 26
LOOP 27
LOOP 28
LOOP 29
LOOP 30
LOOP 31
LOOP 32
LOOP 33
LOOP 34
LOOP 35
LOOP 36
LOOP 37
LOOP 38
LOOP 39
LOOP 40
LOOP 41
LOOP 42
LOOP 43
LOOP 44
LOOP 45
LOOP 46
LOOP 47
LOOP 48
LOOP 49
LOOP 50
LOOP 51
LOOP 52
LOOP 53
LOOP 54
LOOP 55
LOOP 56
Total  2009 items

CSV:

在此輸入圖像描述

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM