![](/img/trans.png)
[英]Error in reading files in Azure blob storage from laptop spark
[英]Azure Blob storage error can't parse a date in spark
我正在尝试使用 python 将 azure datalake gen2 中分配的文件读取到 spark dataframe 中。
代码是
from pyspark import SparkConf
from pyspark.sql import SparkSession
# create spark session
key = "some_key"
appName = "DataExtract"
master = "local[*]"
sparkConf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("fs.azure.account.key.myaccount.dfs.core.windows.net", key)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
data_csv="abfs://test-file-system@myaccount.dfs.core.windows.net/data.csv"
data_out = "abfs://test-file-system@myaccount.dfs.core.windows.net/data_out.csv"
# read csv
df = self.spark_session.read.csv(data_csv)
# write csv
df.write.csv(data_out)
该文件已读取并写入良好,但出现以下错误
ERROR AzureBlobFileSystemStore: Failed to parse the date Thu, 09 Sep 2021 10:12:34 GMT
日期似乎是文件创建日期。
如何解析日期以避免出现错误?
我尝试重现同样的问题,发现正是这些行导致了错误。
data_csv="abfs://test-file-system@myaccount.dfs.core.windows.net/data.csv" data_out = "abfs://test-file-system@myaccount.dfs.core.windows.net/data_out.csv" # read csv df = self.spark_session.read.csv(data_csv) ```
这是当我尝试替换上面的代码行时对我有用的代码。 abfs
到abfss
from pyspark import SparkConf
from pyspark.sql import SparkSession
# create spark session
key = "<Your Storage Account Key>"
appName = "<Synapse App Name>"
master = "local[*]"
sparkConf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("fs.azure.account.key.<Storage Account Name>.dfs.core.windows.net", key)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
data_csv="abfss://<ContainerName>@<Storage Account Name>.dfs.core.windows.net/<Directory>"
# read csv
df1 = spark.read.option('header','true')\
.option('delimiter', ',')\
.csv(data_csv + '/sample1.csv')
df1.show()
# write csv
df2 = df1.write.csv(data_csv + '/<Give the name of blob you want to write to>.csv')
否则你甚至可以尝试下面的代码,它非常适合我
from pyspark.sql import SparkSession
from pyspark.sql.types import *
account_name = "<StorageAccount Name>"
container_name = "<Storage Account Container Name>"
relative_path = "<Directory path>"
adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s'%(container_name,account_name,relative_path)
dataframe1 = spark.read.option('header','true')\
.option('delimiter', ',')\
.csv(adls_path + '/sample1.csv')
dataframe1.show()
dataframe2 = dataframe1.write.csv(adls_path + '/<Give the name of blob you want to write to>.csv')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.