简体   繁体   English

使用PySpark从Blob存储容器加载CSV文件

[英]Loading a CSV file from Blob Storage Container using PySpark

I am unable to load a CSV file directly from Azure Blob Storage into a RDD by using PySpark in a Jupyter Notebook. 我无法通过在Jupyter笔记本中使用PySpark将CSV文件直接从Azure Blob存储加载到RDD中。

I have read through just about all of the other answers to similar problems but I haven't found specific instructions for what I am trying to do. 我已经阅读了类似问题的所有其他答案,但我没有找到具体的说明我正在尝试做什么。 I know I could also load the data into the Notebook by using Pandas, but then I would need to convert the Panda DF into an RDD afterwards. 我知道我也可以使用Pandas将数据加载到Notebook中,但之后我需要将Panda DF转换为RDD。

My ideal solution would look something like this, but this specific code give me the error that it can't infer a schema for CSV. 我理想的解决方案看起来像这样,但是这个特定的代码给我的错误是它不能推断出CSV的模式。

#Load Data source = <Blob SAS URL> elog = spark.read.format("csv").option("inferSchema", "true").option("url",source).load()

I have also taken a look at this answer: reading a csv file from azure blob storage with PySpark but I am having trouble defining the correct path. 我还看了一下这个答案: 用PySpark从azure blob存储中读取一个csv文件,但是我无法定义正确的路径。

Thank you very much for your help! 非常感谢您的帮助!

Here is my sample code with Pandas to read a blob url with SAS token and convert a dataframe of Pandas to a PySpark one. 下面是我的示例代码,其中包含Pandas,用SAS令牌读取blob url,并将Pandas的数据帧转换为PySpark。

First, to get a Pandas dataframe object via read a blob url. 首先,通过读取blob url来获取Pandas数据帧对象。

import pandas as pd

source = '<a csv blob url with SAS token>'
df = pd.read_csv(source)
print(df)

Then, you can convert it to a PySpark one. 然后,您可以将其转换为PySpark。

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("testDataFrame").getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show()

Or, the same result with the code below. 或者,与下面的代码相同的结果。

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
spark_df.show()

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 错误代码 DFExecutorUserError / 将 CSV 文件从 Blob 存储加载到 Azure SQL 数据库时出现问题 - Error code DFExecutorUserError / Problem with loading a CSV file from Blob storage to an Azure SQL Database 使用 ADF 从 Blob 存储到 SQL 服务器获取 CSV 数据 - Get CSV Data from Blob Storage to SQL server Using ADF 使用 pyspark 从 CSV 文件中拆分字段 - Splitting fields from a CSV file using pyspark Pyspark(来自 csv 文件)正在加载不同格式的数据帧 - Pyspark (from csv file) is loading dataframe in a different format 将CSV文件上传到Azure BLOB存储 - Upload csv file to Azure BLOB Storage 从 Google Cloud Storage 加载 csv 文件时出现 BigQuery 错误 - BigQuery error when loading csv file from Google Cloud Storage 如何在没有 Azure 数据工厂的情况下将 csv 文件从 blob 存储加载到 azure sql 数据库 - How to load csv file from blob storage to azure sql database without Azure Data Factory 在哪里托管数据摄取 ETL? 输入数据(csv 文件)自动从 Azure blob 存储到 Azure Posgresql - Where to host a data ingestion ETL ? input data (csv file) automatically from Azure blob storage to Azure Posgresql 将 CSV 文件从 Azure blob 存储批量插入到 SQL 托管实例 - Bulk insert CSV file from Azure blob storage to SQL managed instance pyspark-使用CSV文件中的sqlCtx.load()创建数据帧 - pyspark - creating dataframes using sqlCtx.load() from CSV file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM