简体   繁体   English

Pandas 返回空数据帧

[英]Pandas returning Empty Data Frame

I am trying to scrape a javascript heavy website.我正在尝试抓取一个 javascript 重网站。 I am trying to get a specific columns contents.我正在尝试获取特定的列内容。 The page needs to load and then navigate to a new page.该页面需要加载,然后导航到新页面。 I would like to extract the sport info from the page.我想从页面中提取运动信息。

I am using Pandas BeautifulSoup and Selenium我正在使用Pandas BeautifulSoupSelenium

Navigating to the next page works fine and the loading wait times.导航到下一页工作正常,加载等待时间。 The below is the BeautifulSoup code:下面是BeautifulSoup代码:

soup = BeautifulSoup(results.get_attribute("outerHTML"), 'html.parser')
time = []  # Time
sport = []  # Sport Name
description = []  # Sport Description

The below is the code that will search for the xPath of the specific parts of the page.下面是搜索页面特定部分的xPath的代码。

# Programme time
for item in soup.select("guide___1Ogg9"):
    if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]'):
        time.append(item.find_next(
            find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]').text.strip())
    else:
        time.append("Nan")

# Sport Name
    if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span'):
        sport.append(item.find_next(
            find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span').text.strip())
    else:
        sport.append("Nan")

# Programme info
    if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]'):
        description.append(item.find_next(
            find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]').text.strip())
    else:
        description.append("Nan")

Below is the function to print all the data into the csv file.下面是 function 将所有数据打印到 csv 文件中。

df = pd.DataFrame(
    {"Time": time, "Sport": sport, "Info": description})
print("Here is your data. Right I am off to sleep then!")

print(df)
df.to_csv("canalPlusSport.csv")

I have tried to search the CSS_SELECTOR and CLASS_NAME我试图搜索CSS_SELECTORCLASS_NAME

The website is https://www.canalplus.com/programme-tv/该网站是https://www.canalplus.com/programme-tv/

在此处输入图像描述

You're right saying the site's JavaScript heavy but that might mean there's an API on the backend.您说得对,该网站的JavaScript很重,但这可能意味着后端有一个 API。 And, actually, in this case there is one.而且,实际上,在这种情况下有一个。

You can use it to fetch the data you want.您可以使用它来获取所需的数据。

Here's how:就是这样:

import datetime

import pendulum
import requests
from tabulate import tabulate

api_url = "https://secure-webtv-static.canal-plus.com/metadata/cpfra/all/v2.2/globalchannels.json"
response = requests.get(api_url).json()

tv_programme = {
    channel["name"]: [
        [
            e['title'],
            e['subTitle'],
            pendulum.parse(e['timecodes'][0]['start']).time().strftime("%H:%M"),
            datetime.timedelta(
                milliseconds=e['timecodes'][0]['duration'],
            ).__str__().rsplit(".")[0],
        ] for e in channel["events"]
    ] for channel in response["channels"]
}


print(tabulate(
    tv_programme["CANAL+"],
    headers=["Title", "Subtitle", "Time", "Duration"],
    tablefmt="sql",
))

This outputs (for CANAL+ , but you can try any channel):此输出(对于CANAL+ ,但您可以尝试任何通道):

Title                                                                     Subtitle                         Date    Duration
------------------------------------------------------------------------  -------------------------------  ------  ----------
Canal Football Club - Samedi - 1re édition                                Mag Foot                         19:30   0:23:00
Avant-match Ligue 1                                                       Mag Foot                         19:58   0:04:36
Nice / Lyon                                                               16e journée                      20:02   0:50:00
Canal Football Club - Samedi - 2ème édition                               Mag Foot                         21:59   0:55:00
Zapsport                                                                  Mag Sport                        22:56   0:03:41
Le Plus                                                                   Le Show de Noël Must Go on Date  23:00   0:01:59
Le journal du hard                                                        Mag Adultes                      23:02   0:01:07
Une nuit à Budapest                                                       Film Adultes                     23:03   1:32:14
Furie                                                                     Film Suspense                    00:35   1:33:49
Zombi Child                                                               Film Emotion                     02:10   1:38:39
Veuillez parler sans arrêt et décrire vos expériences au fur et à mesure  Court-Metrage                    03:49   0:09:04
Le grand rendez-vous                                                      Court-Metrage                    03:58   0:05:39
Golf - US Open féminin                                                    3e tour                          04:05   1:08:26

EDIT:编辑:

To list all the channels, just add this print("\n".join(sorted(list(tv_programme.keys()))))要列出所有频道,只需添加此print("\n".join(sorted(list(tv_programme.keys()))))

This will get you this:这将为您提供:

6TER
AB1
ACTION
ALTICE STUDIO
ANIMAUX
ARTE
ASTROCENTER TV
AUTOMOTO LA CHAINE
BBC WORLD NEWS
BEIN SPORTS 1
BEIN SPORTS 2
BEIN SPORTS 3
BEIN SPORTS MAX 10
BEIN SPORTS MAX 4
BEIN SPORTS MAX 5
BEIN SPORTS MAX 6
BEIN SPORTS MAX 7
BEIN SPORTS MAX 8
BEIN SPORTS MAX 9
BET
BFM BUSINESS
BFM TV
BOB TV
BOING
BOOMERANG
BSMART TV
C8
C8 (CH)
CANAL 9
CANAL ALPHA NE
CANAL J
CANAL+
CANAL+ (CH)
CANAL+ CINEMA
CANAL+ CINEMA (CH)
CANAL+ DECALE
CANAL+ DECALE (CH)
CANAL+ FAMILY
CANAL+ FAMILY (CH)
CANAL+ FORMULA1
CANAL+ LIGUE1
CANAL+ MOTOGP
CANAL+ PREMIER LEAGUE
CANAL+ SERIES
CANAL+ SPORT
CANAL+ SPORT (CH)
CANAL+ TOP14
CANAL+ UHD
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM