使用PySpark从Blob存储容器加载CSV文件

Question

I am unable to load a CSV file directly from Azure Blob Storage into a RDD by using PySpark in a Jupyter Notebook. 我无法通过在Jupyter笔记本中使用PySpark将CSV文件直接从Azure Blob存储加载到RDD中。

I have read through just about all of the other answers to similar problems but I haven't found specific instructions for what I am trying to do. 我已经阅读了类似问题的所有其他答案，但我没有找到具体的说明我正在尝试做什么。 I know I could also load the data into the Notebook by using Pandas, but then I would need to convert the Panda DF into an RDD afterwards. 我知道我也可以使用Pandas将数据加载到Notebook中，但之后我需要将Panda DF转换为RDD。

My ideal solution would look something like this, but this specific code give me the error that it can't infer a schema for CSV. 我理想的解决方案看起来像这样，但是这个特定的代码给我的错误是它不能推断出CSV的模式。

#Load Data source = <Blob SAS URL> elog = spark.read.format("csv").option("inferSchema", "true").option("url",source).load()

I have also taken a look at this answer: reading a csv file from azure blob storage with PySpark but I am having trouble defining the correct path. 我还看了一下这个答案：用PySpark从azure blob存储中读取一个csv文件，但是我无法定义正确的路径。

Thank you very much for your help! 非常感谢您的帮助！

Answer 1

Here is my sample code with Pandas to read a blob url with SAS token and convert a dataframe of Pandas to a PySpark one. 下面是我的示例代码，其中包含Pandas，用SAS令牌读取blob url，并将Pandas的数据帧转换为PySpark。

First, to get a Pandas dataframe object via read a blob url. 首先，通过读取blob url来获取Pandas数据帧对象。

import pandas as pd

source = '<a csv blob url with SAS token>'
df = pd.read_csv(source)
print(df)

Then, you can convert it to a PySpark one. 然后，您可以将其转换为PySpark。

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("testDataFrame").getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show()

Or, the same result with the code below. 或者，与下面的代码相同的结果。

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
spark_df.show()

Hope it helps. 希望能帮助到你。

使用PySpark从Blob存储容器加载CSV文件

问题描述

1 个解决方案

解决方案1
0 2019-05-07 07:49:02

使用PySpark从Blob存储容器加载CSV文件

问题描述

1 个解决方案

解决方案1 0 2019-05-07 07:49:02

解决方案1
0 2019-05-07 07:49:02