简体   繁体   English

将大文件拆分成块

[英]Splitting a large file into chunks

I've a file with 7946479 records, i want to read the file line by line and insert into the database(sqlite).我有一个包含 7946479 条记录的文件,我想逐行读取文件并插入数据库(sqlite)。 My first approach was open the file read the records line by line and insert into the database at the same time, since it dealing with huge amount of data it taking very long time.I want to change this naive approach so when i searched on internet i saw this [python-csv-to-sqlite][1] in this they have the data in a csv file but the file i have is dat format but i like the answer to that problem so now i am trying to do it like in the solution.我的第一种方法是打开文件逐行读取记录并同时插入数据库,因为它处理大量数据需要很长时间。我想改变这种幼稚的方法,所以当我在互联网上搜索时我看到了这个 [python-csv-to-sqlite][1],他们在 csv 文件中有数据,但我拥有的文件是dat格式,但我喜欢这个问题的答案,所以现在我正在尝试这样做在解决方案中。 https://stackoverflow.com/questions/5942402/python-csv-to-sqlite https://stackoverflow.com/questions/5942402/python-csv-to-sqlite

The approach they using is like first they splitting the whole file into chunks then doing the database transaction instead of writing each record one at a time.他们使用的方法就像首先将整个文件分成块然后进行数据库事务,而不是一次写入每条记录。

So i started writing a code for splitting my file into chunks Here is my code,所以我开始编写将我的文件分成块的代码这是我的代码,

file = r'files/jan.dat'
test_file = r'random_test.txt'


def chunks(file_obj, size=10000):
counter = 0
file_chunks = []
temp_chunks = []

for line in file_obj:
    if line == '\n':
        continue
    if counter != size:
        temp_chunks.append(line)
        counter += 1
    else:
        file_chunks.append(temp_chunks)
        temp_chunks = []
        counter = 0
file_obj.close()
if len(temp_chunks) != 0:
    file_chunks.append(temp_chunks)

yield file_chunks

if __name__ == '__main__':
    split_files = chunks(open(test_file))
    for chunk in split_files:
        print(len(chunk))

the output is 795, but what i wanted is to split the whole file into chunks of size 10000 output 是 795,但我想要的是将整个文件分成大小为 10000 的块

i can't figure out what is going wrong here, i can't share my whole file here so for testing can use this code to generate a file with 7946479 lines我不知道这里出了什么问题,我不能在这里分享我的整个文件,所以为了测试可以使用这个代码生成一个包含 7946479 行的文件

TEXT = 'Hello world'
FILE_LENGTH = 7946479

counter = 0
with open(r'random_test.txt', 'w') as f:
    for _ in range(FILE_LENGTH):
        f.write(f"{TEXT}\n")

this is how my original file looks like (the file format is dat )这就是我的原始文件的样子(文件格式是dat

lat lon day mon t2m rh2m    sf  ws
5   60  1   1   299.215 94.737  209.706 5.213
5   60.25   1   1   299.25  94.728  208.868 5.137
5   60.5    1   1   299.295 94.695  207.53  5.032
5   60.75   1   1   299.353 94.623  206.18  4.945
5   61  1   1   299.417 94.522  204.907 4.833
5   61.25   1   1   299.447 94.503  204.219 4.757
5   61.5    1   1   299.448 94.525  203.933 4.68
5   61.75   1   1   299.443 94.569  204.487 4.584
5   62  1   1   299.44  94.617  204.067 4.464

An easy way to chunk the file is to use f.read(size) until there is no content left.对文件进行分块的一种简单方法是使用f.read(size)直到没有内容为止。 However this method works with character number instead of lines.但是,此方法适用于字符编号而不是行。

test_file = 'random_test.txt'


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.read(size):
            yield content


if __name__ == '__main__':
    split_files = chunks(test_file)
    for chunk in split_files:
        print(len(chunk))

For the last chunk, it will take whatever left, here 143 characters对于最后一个块,它将占用剩下的任何内容,此处为143字符


Same Function with lines同款Function带线

test_file = "random_test.txt"


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.readline():
            for _ in range(size - 1):
                content += f.readline()

            yield content.splitlines()


if __name__ == '__main__':
    split_files = chunks(test_file)

    for chunk in split_files:
        print(len(chunk))


For the last chunk, it will take whatever left, here 6479 lines对于最后一个块,它将占用剩下的任何内容,这里是6479

test_file = r'random_test.txt'

def chunks(file_obj, size=10000):
    counter, chunks = 0, []
    for line in file_obj:
        if line == '\n':
            continue
        counter += 1
        chunks.append(line)
        if counter == size:
            yield chunks
            counter, chunks = 0, []
    file_obj.close()
    if counter:
        yield chunks

if __name__ == '__main__':
    split_files = chunks(open(test_file))
    for chunk in split_files:
        print(len(chunk))

This outputs a ton of 10000 and 6479 at the end.这最终会输出大量 10000 和 6479。 Not that yield keyword is really more suitable here, but it's absolutely useless in place where you used it.并不是说yield关键字在这里真的更合适,但是在你使用它的地方绝对没用。 yield helps to create a lazy iterator: new chunk will be read from file only when we request it. yield有助于创建一个惰性迭代器:只有在我们请求新块时才会从文件中读取它。 This way we don't read the full file in memory.这样我们就不会读取 memory 中的完整文件。

Simply read it using pandas.read_csv with the chunksize argument只需使用带有 chunksize 参数的pandas.read_csv读取它

chunks = pd.read_csv('jan.dat', sep='\s+', chunksize=1000)

for chunk in chunks:
    # Process here

You can also use pandas.DataFrame.to_sql to push it to the database.您也可以使用pandas.DataFrame.to_sql将其推送到数据库。

As a solution to your problem of the tasks taking too long, I would suggest using multiprocessing instead of chunking the text (as it would take just as long but in more steps).作为解决任务耗时过长的问题的解决方案,我建议使用多处理而不是对文本进行分块(因为它会花费同样长的时间但需要更多的步骤)。 Using the multiprocessing library allows multiple processing cores to perform the same task in parallel, resulting in shorter run time.使用多处理库允许多个处理核心并行执行相同的任务,从而缩短运行时间。 Here is an example.这是一个例子。

    import multiprocessing as mp

    # Step 1: Use multiprocessing.Pool() and specify number of cores to use (here I use 4).
    pool = mp.Pool(4)

    # Step 2: Use pool.starmap which takes a multiple iterable arguments
    results = pool.starmap(My_Function, [(variable1,variable2,variable3) for i in data])
    
    # Step 3: Don't forget to close
    pool.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM