[英]Using my own python module in Azure Pyspark, the module reads and prepares the data
My jupyter notebook, reads a module I created, in Jupyter I add 我的Jupyter笔记本读取我创建的模块,在Jupyter中添加
sc.addPyFile('wasb:///HdiNotebooks/PySpark/project/read_test_data.py')
and it loads the module ok, 然后加载好模块,
However, my "py" file, opens the data as: 但是,我的“ py”文件将数据打开为:
data_file= open('wasb:///example/data/fruits.txt', 'rU')
to prepare it and do different calculations.
However, I get the following error 但是,出现以下错误
[Errno 2] No such file or directory: 'wasb:///example/data/fruits.txt'
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: 'wasb:///example/data/fruits.txt'
If I tried to create a dataframe with the same data in jupyter I run 如果我尝试在jupyter中创建具有相同数据的数据框,则会运行
df=sqlContext.read.csv('wasb:///example/data/fruits.txt',header='true', inferSchema='true')
And I dont get any errors. 而且我没有任何错误。 What I'm doing wrong?
我做错了什么?
Python API open
does not support the wasb
protocol for HDInsight DFS based on Azure Blob Storage. Python API
open
不支持基于Azure Blob存储的HDInsight DFS wasb
协议。
If you want to directly read the file on HDInsight without pyspark, the only way is using Azure Storage SDK for Python to do it with account_name
& account_key
of Azure Blob Storage Account for HDInsight like the document said, and please refer to the offical tutorial for Azure Storage in Python. 如果您想不使用pyspark而直接在HDInsight上读取文件,则唯一的方法是使用适用于Python的Azure存储SDK来使用适用于HDInsight的Azure Blob存储帐户的
account_name
和account_key
,如文档所述,请参阅官方教程以获取相关信息。 Python中的Azure存储。
Hope it helps. 希望能帮助到你。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.