简体   繁体   English

使用 bag.read_text 读取多个文件时如何保留行号?

[英]how to keep the line number when using bag.read_text to read multiple files?

I want to read multiple.txt files by bag.read_text and record the file path and the line number of each line for some more processing.我想通过 bag.read_text 读取 multiple.txt 文件,并记录文件路径和每行的行号,以便进行更多处理。 The read_text function has an argument include_path=True which can keep the path, but how can I get the line number? read_text function 有一个参数include_path=True可以保留路径,但是我怎样才能得到行号呢? Will the read_text() keep the line order after reading files?读取文件后 read_text() 会保持行序吗?

def add_line_number(element):
    line, path = element

    # how to get the line number?
    line_index = ...
    return line, path, line_index

b = db.read_text([file1, file2, ...], blocksize='10 MiB', include_path)
b = b.map(add_line_number).compute()
# expect b to be: [('line 1', 'file path', ith line in that file), ...]

Edit2 Thanks for your help. Edit2感谢您的帮助。 Here are further questions.这里有进一步的问题。 Does the read_text guarantee to preserve line order in each partition when assigning files_per_partition=1? read_text是否保证在分配 files_per_partition=1 时保留每个分区中的行顺序? In the source code, file seems to be read line by line在源代码中,文件似乎是逐行读取的

with OpenFile as f:
    for line in f:
        yield line or (line, path)

Will the parallelization affect it's sequence or this job will only be executed by one worker?并行化会影响它的顺序还是这个工作只能由一个工人执行?

What you want is not really possible.你想要的是不可能的。 When dask splits up a large file, it reads from arbitrary offsets (10MB in your example) and assumes that the next newline character is marks a new line.当 dask 拆分一个大文件时,它会从任意偏移量(在您的示例中为 10MB)读取并假定下一个换行符标记为新行。 Thus, it cannot know, for this chunks (which gets processed with the previous chunks in parallel), how many lines preceded it.因此,对于这个块(与前面的块并行处理),它不知道它之前有多少行。

You can easily enough enumerate the lines of a given chunks (with .map_partitions() , not .map() ), and you can find the number of lines in all chunks, but not in a single pass.您可以很容易地枚举给定块的行数(使用.map_partitions() ,而不是.map() ),并且您可以找到所有块中的行数,但不是一次通过。 To find the number of lines per chunk:要查找每个块的行数:

b.map_partitions(len).compute()

This is probably not what you are looking for, but a non-Python option is to use nl :这可能不是您要查找的内容,但非 Python 选项是使用nl

cat some_file.txt | nl -s "," -w 1 > new_file.txt

The options are:选项是:

  • -w 1 (to avoid blank spaces between line number and line content), -w 1 (避免行号和行内容之间有空格),
  • -s "," (to use comma as a delimiter). -s "," (使用逗号作为分隔符)。

For example:例如:

seq 3 | nl -s "," -w 1
#1,1
#2,2
#3,3

Later, these modified files could be processed with dask.bag .稍后,可以使用dask.bag处理这些修改后的文件。 This is not optimal, since storage requirements increase, but it might be a workable solution if one really needs consecutive line numbers per file.这不是最优的,因为存储需求增加了,但如果确实需要每个文件的连续行号,它可能是一个可行的解决方案。

It can also be modified to attach sequential numbers across multiple files:也可以修改它以在多个文件中附加序号:

cat *.txt | nl -s "," -w 1 > big_file.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM