在Azure Data Lake中计数行

Question

我在Azure数据湖中有一些文件，我需要计算它们必须确保完成的行数。 最好的方法是什么？

我正在使用Python：

from azure.datalake.store import core, lib
adl_creds = lib.auth(tenant_id='fake_value', client_secret='fake_another value', client_id='fake key', resource='https://my_web.azure.net/')
adl = core.AzureDLFileSystem(adl_creds, store_name='fake account')

file_path_in_azure = "my/path/to/file.txt"
if adl.exists(file_path_in_azure) is True:
    # 5 megs 5242880 500megs 524288000 100megs 104857600 1meg 1048576
    counter = 0
    with adl.open(file_path_in_azure, mode="rb", blocksize=5242880) as f:
        # i try to use list comprehension but the memory increase since make a list of 1 [1,1,1,1,1,1,1] and then sums all
        # counter1 = sum(1 for line in f)
        for line in f:
            counter = counter + 1

print(counter)

此方法有效，但是文件大小为1或2 GB的文件需要花费数小时。 这不应该更快吗？ 可能会有更好的方法吗？

Answer 1

您需要数行吗？ 也许足以获取文件的大小？ 您具有AzureDLFileSystem.stat来获取文件大小，如果您知道平均行大小有多长时间，则可以计算预期的行数。

Answer 2

您可以尝试：

for file in adl.walk('path/to/folder'):
    counter += len(adl.cat(file).decode().split('\n'))

我不确定这实际上是否更快，但是它使用内置的unix来获取文件输出，这可能比显式I / O更快。

编辑：此方法的一个陷阱是文件大小超出了您在其上运行的设备的RAM，因为cat会将内容明确地扔到内存中

Answer 3

我发现的唯一更快的方法是实际将文件本地下载到运行脚本的位置

 adl.put(remote_file, locally)

然后逐行计数，而没有将所有文件放入内存中，下载500mgs大约需要30秒，大约需要4秒钟才能读取1mill行=）

在Azure Data Lake中计数行

问题描述

3 个解决方案

解决方案1
0 2018-12-19 16:12:46

解决方案2
0 2018-12-19 16:52:31

解决方案3
0 2018-12-26 17:56:36

在Azure Data Lake中计数行

问题描述

3 个解决方案

解决方案1 0 2018-12-19 16:12:46

解决方案2 0 2018-12-19 16:52:31

解决方案3 0 2018-12-26 17:56:36

解决方案1
0 2018-12-19 16:12:46

解决方案2
0 2018-12-19 16:52:31

解决方案3
0 2018-12-26 17:56:36