[英]Spark.read() multiple paths at once instead of one-by-one in a for loop
I am running the following code:我正在运行以下代码:
list_of_paths is a list with paths that end to an avro file. list_of_paths 是一个包含以 avro 文件结尾的路径的列表。 For example,
例如,
['folder_1/folder_2/0/2020/05/15/10/41/08.avro', 'folder_1/folder_2/0/2020/05/15/11/41/08.avro', 'folder_1/folder_2/0/2020/05/15/12/41/08.avro']
Note: The above paths are stored in Azure Data Lake storage, and the below process is executed in Databricks注:以上路径存储在Azure Data Lake存储中,以下流程在Databricks中执行
spark.conf.set("fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name), storage_account_key)
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
begin_time = time.time()
for i in range(len(list_of_paths)):
try:
read_avro_data,avro_decoded=None,None
#Read paths from Azure Data Lake "abfss"
read_avro_data=spark.read.format("avro").load("abfss://{0}@{1}.dfs.core.windows.net/{2}".format(storage_container_name, storage_account_name, list_of_paths[i]))
except Exception as e:
custom_log(e)
Schema架构
read_avro_data.printSchema()
root
|-- SequenceNumber: long (nullable = true)
|-- Offset: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Properties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Body: binary (nullable = true)
# this is the content of the AVRO file.
Number of rows and columns行数和列数
print("ROWS: ", read_avro_data.count(), ", NUMBER OF COLUMNS: ", len(read_avro_data.columns))
ROWS: 2 , NUMBER OF COLUMNS: 6
What I want is not to read 1 AVRO file per iteration, so 2 rows of content at one iteration.我想要的不是每次迭代读取 1 个 AVRO 文件,所以在一次迭代中读取 2 行内容。 Instead, I want to read all the AVRO files at once.
相反,我想一次读取所有 AVRO 文件。 So 2x3 = 6 rows of content at my final spark DataFrame.
所以 2x3 = 6 行内容在我的最终火花 DataFrame。
Is this feasible with spark.read()?这对 spark.read() 可行吗? Something like the following:
类似于以下内容:
spark.read.format("avro").load("abfss://{0}@{1}.dfs.core.windows.net/folder_1/folder_2/0/2020/05/15/*")
[Update] Sorry for the misunderstanding of wildcard(*). [更新] 对通配符(*)的误解深表歉意。 This implies that all AVRO files are in the same folder.
这意味着所有 AVRO 文件都在同一个文件夹中。 But rather, I have 1 folder per AVRO file.
但是,每个 AVRO 文件我有 1 个文件夹。 So 3 AVRO files, 3 folders.
所以 3 个 AVRO 文件,3 个文件夹。 In this case the wildcard won't work.
在这种情况下,通配符将不起作用。 The solution as answered below is the use of a list [] with path names.
下面回答的解决方案是使用带有路径名的列表 []。
Thank you in advance for your help and advice.提前感谢您的帮助和建议。
load(path=None, format=None, schema=None, **options)
this method will accept single path or list of paths. load(path=None, format=None, schema=None, **options)
此方法将接受单个路径或路径列表。
For example, You can directly pass list of paths like below例如,您可以直接传递路径列表,如下所示
spark.read.format("avro").load(["/tmp/dataa/userdata1.avro","/tmp/dataa/userdata2.avro"]).count()
1998
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.