简体   繁体   English

在Azure Data Lake中计数行

[英]Counting lines in Azure Data Lake

I have some files in Azure Data Lake and I need to count how many lines they have to make sure they are complete. 我在Azure数据湖中有一些文件,我需要计算它们必须确保完成的行数。 What would be the best way to do it? 最好的方法是什么?

I am using Python: 我正在使用Python:

from azure.datalake.store import core, lib
adl_creds = lib.auth(tenant_id='fake_value', client_secret='fake_another value', client_id='fake key', resource='https://my_web.azure.net/')
adl = core.AzureDLFileSystem(adl_creds, store_name='fake account')

file_path_in_azure = "my/path/to/file.txt"
if adl.exists(file_path_in_azure) is True:
    # 5 megs 5242880 500megs 524288000 100megs 104857600 1meg 1048576
    counter = 0
    with adl.open(file_path_in_azure, mode="rb", blocksize=5242880) as f:
        # i try to use list comprehension but the memory increase since make a list of 1 [1,1,1,1,1,1,1] and then sums all
        # counter1 = sum(1 for line in f)
        for line in f:
            counter = counter + 1

print(counter)

This works, but it takes hours for files that are 1 or 2 gigabytes. 此方法有效,但是文件大小为1或2 GB的文件需要花费数小时。 Shouldn't this be faster? 这不应该更快吗? Might there be a better way? 可能会有更好的方法吗?

Do you need to count lines? 您需要数行吗? Maybe it is enough to get size of the file? 也许足以获取文件的大小? You have AzureDLFileSystem.stat to get the file size, If you know how long is an average line size you could calculate the expected line count. 您具有AzureDLFileSystem.stat来获取文件大小,如果您知道平均行大小有多长时间,则可以计算预期的行数。

You could try: 您可以尝试:

for file in adl.walk('path/to/folder'):
    counter += len(adl.cat(file).decode().split('\n'))

I'm not sure if this is actually faster, but it uses the unix built ins to get file output which might be quicker than explicit I/O 我不确定这实际上是否更快,但是它使用内置的unix来获取文件输出,这可能比显式I / O更快。

EDIT: The one pitfall of this method is in the case that file sizes exceed the RAM of the device you run this on, as cat will throw the contents into memory explicitly 编辑:此方法的一个陷阱是文件大小超出了您在其上运行的设备的RAM,因为cat会将内容明确地扔到内存中

The only faster way i found, was to actually download the file locally to where the script is running with 我发现的唯一更快的方法是实际将文件本地下载到运行脚本的位置

 adl.put(remote_file, locally)

and then count line by line with out putting all file into the memory, download 500mgs takes around 30secs and reading 1mill lines around 4 secs =) 然后逐行计数,而没有将所有文件放入内存中,下载500mgs大约需要30秒,大约需要4秒钟才能读取1mill行=)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM