简体   繁体   中英

Scrape data from a website with images and data

I need to get all the following data from https://www.airliners.net/ click on last 7 days. then a list of aircraft photos appear. is it possible to loop through all these . example of first image . get

Aeroflot-Russian Airlines / Sukhoi SSJ-100-95-LR-100 Superjet 100 (RRJ-95LR) / 
Moscow - Sheremetyevo (SVO / UUEE) / Russia - May 5, 2019 / REG: RA-89098 / MSN: 95135

In this example there are 56 pages to loop from. At present I have got to spend whole of my weekend copy and pasting for my aviation project. Hoping there might be a solution to this using python

I tried to use some web scraping code , but could not get it to work

I would like the data to be saved in either a comma delimited file or a csv file if possible.

This could help is not 100% tested but is something.

# -*- coding: utf-8 -*-

import pandas 
import requests
import lxml.html

from sys import exit
from pprint import pprint

data = []
with requests.Session() as session:

    loop = 1
    while True:

        response = session.get('https://www.airliners.net/search', headers={
                'authority': 'www.airliners.net',
                'upgrade-insecure-requests': '1',
                'dnt': '1',
                'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
                'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
                'referer': 'https://www.airliners.net/',
                'accept-encoding': 'gzip, deflate, br',
                'accept-language': 'es-ES,es;q=0.9,en;q=0.8',
            }, params=(
                ('dateAccepted', '7'),
                ('sortBy', 'viewCount'),
                ('page', loop)
            ))

        root = lxml.html.fromstring(response.content.strip().decode("utf-8"))
        elements = root.xpath('//div[@class="ps-v2-results ps-v2-results-display-detail photo-grid"]/div')

        if len(elements) == 0:
            break

        # Each element 
        for index,row in enumerate(elements):
            element = row.xpath('div/div')

            if 'spacer' in row.xpath('@class')[0]:
                continue

            caption = None 
            state = None 

            try:
                photo,aircraft,id_number,location_date,photographer = element
            except ValueError:
                # ps-v2-results-col ps-v2-results-col-caption
                photo,aircraft,id_number,location_date,photographer,caption = element

            # ps-v2-results-col ps-v2-results-col-photo
            # ps-v2-results-col ps-v2-results-col-aircraft
            # ps-v2-results-col ps-v2-results-col-id-numbers
            # ps-v2-results-col ps-v2-results-col-location-date
            # ps-v2-results-col ps-v2-results-col-photographer

            # Photo
            photo = photo.xpath('div[2]/div/a/img/@src')[0].strip()

            # Arcraft
            try:
                aircraft = aircraft.xpath('div[2]/div/div[2]/a/text()')[0].strip()
            except IndexError:
                aircraft = None

            # Reg , MSN
            try:
                reg,msn = id_number.xpath('div[2]/div/div')
                reg = reg.xpath('a/text()')[0].strip()
                msn = msn.xpath('a/text()')[0].strip()

            except ValueError:

                try:
                    reg = id_number.xpath('div[2]/div/div/a/text()')[0].strip()
                except IndexError:
                    reg = None 

                msn = None 

            # Location, Date
            city,date = location_date.xpath('div[2]/div/div')
            city = city.xpath('a/text()')[0].strip()

            try:
                country,date = date.xpath('a')
            except ValueError:
                try:
                    state,country,date = date.xpath('a')
                except  ValueError:
                    state,country,date = (None,None,None)
                else:
                    country = country.xpath('text()')[0].strip()
                    date = date.xpath('text()')[0].strip()

            if state is not None:
                state = state.xpath('text()')[0].strip()

            # Photographer
            photographer = photographer.xpath('div[2]/div/div/div/div/div[1]/a/text()')[0].strip()

            # Caption
            if caption is not None:
                caption = caption.xpath('div[2]/text()')[0].strip()

            data.append({
                'photo' :photo,
                'aircraft' : aircraft,
                'reg' : reg,
                'msn' : msn,
                'city' : city,
                'date': date,
                'country': country,
                'photographer': photographer,
                'caption' : caption,
                'state' : state
            })


        print 'LOOP',loop
        loop += 1

print "Total " , len(data), "items"
df = pandas.DataFrame(data)
df.to_csv('data.csv',encoding='utf-8',index= False)

LOG:

LOOP 1
LOOP 2
LOOP 3
LOOP 4
LOOP 5
LOOP 6
LOOP 7
LOOP 8
LOOP 9
LOOP 10
LOOP 11
LOOP 12
LOOP 13
LOOP 14
LOOP 15
LOOP 16
LOOP 17
LOOP 18
LOOP 19
LOOP 20
LOOP 21
LOOP 22
LOOP 23
LOOP 24
LOOP 25
LOOP 26
LOOP 27
LOOP 28
LOOP 29
LOOP 30
LOOP 31
LOOP 32
LOOP 33
LOOP 34
LOOP 35
LOOP 36
LOOP 37
LOOP 38
LOOP 39
LOOP 40
LOOP 41
LOOP 42
LOOP 43
LOOP 44
LOOP 45
LOOP 46
LOOP 47
LOOP 48
LOOP 49
LOOP 50
LOOP 51
LOOP 52
LOOP 53
LOOP 54
LOOP 55
LOOP 56
Total  2009 items

CSV:

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM