简体   繁体   English

在 Jupyter 笔记本中无法 plot 和 Pandas dataframe

[英]Unable to plot a Pandas dataframe in Jupyter notebook

I am coding in a Jupyter notebook that I opened through a GCP cluster.我正在通过 GCP 集群打开的 Jupyter 笔记本中进行编码。 I am reading data in from BigQuery using the Spark-BigQuery connector.我正在使用 Spark-BigQuery 连接器从 BigQuery 中读取数据。 I'm trying to take a subset of this data and plot it, but whenever I try to run the command, the kernel disconnects/reconnects.我正在尝试获取这些数据的一个子集和 plot 它,但是每当我尝试运行该命令时,kernel 就会断开/重新连接。 This has happened before in places where I was doing something wrong and hadn't noticed (so I know that it isn't just disconnecting at random).这种情况以前发生在我做错事但没有注意到的地方(所以我知道这不仅仅是随机断开连接)。 But in this case, I really have no idea what I'm doing wrong.但在这种情况下,我真的不知道我做错了什么。 What I'm doing is very similar to the following tutorial on GitHub.我正在做的与 GitHub 上的以下教程非常相似。 I read the data to a Spark Dataframe.我将数据读取到 Spark Dataframe。 Then I convert the dataframe into a Pandas dataframe and try to plot it.然后我将 dataframe 转换为 Pandas dataframe 并尝试 Z32FA6E1B78A9D40289Z53E 它是 Z32FA6E1B78A9D40289Z53E。 This is where the error occurs.这是发生错误的地方。 I've experimented with different sized subsets, so I know this isn't happening because my dataset is too big.我已经尝试过不同大小的子集,所以我知道这不会发生,因为我的数据集太大了。 I've also tried creating a "test" dataframe with random numbers and plotting that - it works perfectly.我还尝试使用随机数创建一个“测试” dataframe 并绘制它 - 它工作得很好。 So it has to be a problem with my dataset...I'm just not sure what.所以它必须是我的数据集的问题......我只是不确定是什么。 Code below:下面的代码:

Reading the data in:读取数据:

import pandas as pd
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName('Jupyter BigQuery Storage')\
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
  .getOrCreate()

table = "bigquery-public-data.ncaa_basketball.mbb_pbp_sr"
df = spark.read \
  .format("bigquery") \
  .option("table", table) \
  .load()
df.printSchema()

df.createOrReplaceTempView('df')

query_string = """
    SELECT event_type,
    season,
    type,
    team_alias,
    team_market,
    team_name,
    team_basket,
    event_id,
    event_coord_x,
    event_coord_y,
    three_point_shot,
    shot_made
    FROM df
    WHERE type = "fieldgoal"
        AND event_coord_x IS NOT NULL
        AND event_coord_y IS NOT NULL
    ORDER BY season
"""

df_shots = spark.sql(query_string)
df_shots.orderBy("season", "event_id").toPandas().head(5)

import matplotlib.pyplot as plt
%matplotlib inline

df_test = df_shots.toPandas()

test_new.plot(x='event_coord_x',y='event_coord_y',kind='line',figsize=(12,6))

The output for the last part is just:最后一部分的 output 只是:

<matplotlib.axes._subplots.AxesSubplot at 0x7f355a732950>

And then the kernel disconnects/reconnects.然后 kernel 断开/重新连接。 For reference, both event_coord_x and event_coord_y are of type float64.作为参考,event_coord_x 和 event_coord_y 都是 float64 类型。 I don't see why that would cause any problems, but I even tried converting them to integers and plotting and the issue still arises.我不明白为什么这会导致任何问题,但我什至尝试将它们转换为整数并绘图,但问题仍然存在。

I have a feeling that this may be something really trivial, but right now I'm stumped.我有一种感觉,这可能是一件非常微不足道的事情,但现在我被难住了。 Sorry I don't have anything specific like an error message (because there isn't one).抱歉,我没有任何具体的信息,例如错误消息(因为没有)。 Any suggestions would be immensely helpful.任何建议都会非常有帮助。

When using Cloud Dataproc 1.5 image version , the kernel appears to die and restart, while plotting the figure.使用Cloud Dataproc 1.5 映像版本时,kernel 在绘制图形时似乎死机并重新启动。 It can be seen in logs from Jupyter.它可以在 Jupyter 的日志中看到。 The problem is connected to Apache Knox , which is used by Cloud Dataproc cluster.问题与 Cloud Dataproc 集群使用的Apache Knox相关。

Knox limits websocket message size to the buffer size, and it's insufficient for some Jupyter interactions. Knox 将 websocket 消息大小限制为缓冲区大小,这对于一些 Jupyter 交互来说是不够的。 This should be fixed in the next image release.这应该在下一个图像版本中修复。

For now, the workaround is to use Cloud Dataproc 1.4 image version or changing the figsize parameter to smaller values.目前,解决方法是使用Cloud Dataproc 1.4映像版本或将figsize参数更改为更小的值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM