使用我自己的Azure Pyspark中的python模块，该模块读取并准备数据

Question

My jupyter notebook, reads a module I created, in Jupyter I add 我的Jupyter笔记本读取我创建的模块，在Jupyter中添加

sc.addPyFile('wasb:///HdiNotebooks/PySpark/project/read_test_data.py')

and it loads the module ok, 然后加载好模块，

However, my "py" file, opens the data as: 但是，我的“ py”文件将数据打开为：

data_file= open('wasb:///example/data/fruits.txt', 'rU') 
to prepare it and do different calculations.

However, I get the following error 但是，出现以下错误

[Errno 2] No such file or directory: 'wasb:///example/data/fruits.txt'
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: 'wasb:///example/data/fruits.txt'

If I tried to create a dataframe with the same data in jupyter I run 如果我尝试在jupyter中创建具有相同数据的数据框，则会运行

df=sqlContext.read.csv('wasb:///example/data/fruits.txt',header='true', inferSchema='true')

And I dont get any errors. 而且我没有任何错误。 What I'm doing wrong? 我做错了什么？

Answer 1

Python API open does not support the wasb protocol for HDInsight DFS based on Azure Blob Storage. Python API open不支持基于Azure Blob存储的HDInsight DFS wasb协议。

If you want to directly read the file on HDInsight without pyspark, the only way is using Azure Storage SDK for Python to do it with account_name & account_key of Azure Blob Storage Account for HDInsight like the document said, and please refer to the offical tutorial for Azure Storage in Python. 如果您想不使用pyspark而直接在HDInsight上读取文件，则唯一的方法是使用适用于Python的Azure存储SDK来使用适用于HDInsight的Azure Blob存储帐户的account_name和account_key ，如文档所述，请参阅官方教程以获取相关信息。 Python中的Azure存储。

Hope it helps. 希望能帮助到你。

使用我自己的Azure Pyspark中的python模块，该模块读取并准备数据

问题描述

1 个解决方案

解决方案1
0 2017-08-14 08:09:35

使用我自己的Azure Pyspark中的python模块，该模块读取并准备数据

问题描述

1 个解决方案

解决方案1 0 2017-08-14 08:09:35

解决方案1
0 2017-08-14 08:09:35