I'm working on an API using FastAPI that users can make a request to in order for the following to happen:
I can't quite figure out how to deliver the file to the user in parquet format, for a few reasons:
df.write.parquet('out/path.parquet')
writes the data into a directory at out/path.parquet
which presents a challenge when I try to pass it to starlette.responses.FileResponse
starlette.responses.FileResponse
just seems to print the binary to my console (as demonstrated in my code below)Is this even possible in FastAPI? Is it possible in Flask usingsend_file() ?
Here's the code I have so far. Note that I've tried a few things like the commented code to no avail.
import tempfile
from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse
router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)
df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/sample-data/my.parquet')
@router.get("/applications")
def applications():
df.write.parquet("temp.parquet", compression="snappy")
return FileResponse("part-some-compressed-file.snappy.parquet")
# with tempfile.TemporaryFile() as f:
# f.write(df.rdd.saveAsPickleFile("temp.parquet"))
# return FileResponse("test.parquet")
Thanks!
Edit: I tried using the answers and info provided here , but I can't quite get it working.
I was able to solve the issue, but it's far from elegant. If anyone can provide a solution that doesn't write to disk, I will greatly appreciate it, and will select your answer as the correct one.
I was able to serialize the DataFrame using df.rdd.saveAsPickleFile()
, zip the resulting directory, pass it to a python client, write the resulting zipfile to disk, unzip it, then use SparkContext().pickleFile
before finally loading the DataFrame. Far from ideal, I think.
API:
import shutil
import tempfile
from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse
router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)
df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/my-file.parquet')
@router.get("/applications")
def applications():
temp_parquet = tempfile.NamedTemporaryFile()
temp_parquet.close()
df.rdd.saveAsPickleFile(temp_parquet.name)
shutil.make_archive('test', 'zip', temp_parquet.name)
return FileResponse('test.zip')
Client:
import io
import zipfile
import requests
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
response = requests.get("http://0.0.0.0:5000/applications")
file_like_object = io.BytesIO(response.content)
with zipfile.ZipFile(file_like_object) as z:
z.extractall('temp.data')
rdd = sc.pickleFile("temp.data")
df = spark.createDataFrame(rdd)
print(df.head())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.