How to get pandas dataframe using pyspark

Question

I want to convert "pyspark.sql.dataframe.DataFrame" data to pandas. At the last line, "ConnectionRefusedError: [WinError 10061] Connection failed because the destination computer refused the connection" error occured. How can I fix it?

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, Row
import pandas as pd
import numpy as np
import os
import sys

# spark setting
# local
conf = SparkConf().set("spark.driver.host", "127.0.0.1")
sc = SparkContext(conf=conf)

# session
spark = SparkSession.builder.master("local[1]").appName("test_name").getOrCreate()

# file
path = "./data/fhvhv_tripdata_2022-10.parquet"
# header가 있는 경우 option 추가
data = spark.read.option("header", True).parquet(path)

# Error ocurred
pd_df = data.toPandas()

enter image description here

I want to convert "pyspark.sql.dataframe.DataFrame" data to pandas.

Answer 1

First, ensure you're running pyspark 3.2 or higher, as that's where koalas was added natively.

Then, Connection errors could be many things, but have nothing to do with pandas. Your code is correct. It's the.network/configuration that is not. For example, on Windows, you'll need to configure external binary called winutils .

Note: You don't need a SparkContext here. You can pass options via SparkSession builder.

Otherwise, you're not using Hadoop. So, don't use Spark at all How to read a Parquet file into Pandas DataFrame?

How to get pandas dataframe using pyspark

Question

1 answers

solution1
1 2023-01-08 14:45:42

How to get pandas dataframe using pyspark

Question

1 answers

solution1 1 2023-01-08 14:45:42

solution1
1 2023-01-08 14:45:42