简体   繁体   English

如何制作从大型 xlsx 文件加载 pandas DataFrame 的进度条?

[英]How do I make a progress bar for loading pandas DataFrame from a large xlsx file?

from https://pypi.org/project/tqdm/ :来自https://pypi.org/project/tqdm/

import pandas as pd
import numpy as np
from tqdm import tqdm

df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
tqdm.pandas(desc="my bar!")p`
df.progress_apply(lambda x: x**2)

I took this code and edited it so that I create a DataFrame from load_excel rather than using random numbers:我拿了这段代码并对其进行了编辑,以便我从 load_excel 创建一个 DataFrame 而不是使用随机数:

import pandas as pd
from tqdm import tqdm
import numpy as np

df = pd.DataFrame(pd.read_excel(filename))
df.progress_apply(lambda x: x**2)

This gave me an error, so I changed df.progress_apply to this:这给了我一个错误,所以我将 df.progress_apply 更改为:

df.progress_apply(lambda x: x)

Here is the final code:这是最终代码:

import pandas as pd
from tqdm import tqdm
import numpy as np

df = pd.DataFrame(pd.read_excel(filename))
df.progress_apply(lambda x: x)

This results in a progress bar, but it doesn't actually show any progress, rather it loads the bar, and when the operation is done it jumps to 100%, defeating the purpose.这会产生一个进度条,但它实际上并没有显示任何进度,而是加载了进度条,并且在操作完成后跳到 100%,这违背了目的。

My question is this: How do I make this progress bar work?我的问题是:如何让这个进度条工作?
What does the function inside of progress_apply actually do? progress_apply 中的 function 实际上是做什么的?
Is there a better approach?有更好的方法吗? Maybe an alternative to tqdm?也许是 tqdm 的替代品?

Any help is greatly appreciated.任何帮助是极大的赞赏。

Will not work.不管用。 pd.read_excel blocks until the file is read, and there is no way to get information from this function about its progress during execution. pd.read_excel阻塞,直到文件被读取,并且无法从该函数获取有关其执行过程中进度的信息。

It would work for read operations which you can do chunk wise, like它适用于您可以按块进行的读取操作,例如

chunks = []
for chunk in pd.read_csv(..., chunksize=1000):

But as far as I understand tqdm also needs the number of chunks in advance, so for a propper progress report you would need to read the full file first....但据我所知tqdm还需要提前知道块的数量,因此对于正确的进度报告,您需要先阅读完整文件....

The following is a one-liner solution utilizing tqdm:以下是使用 tqdm 的单行解决方案:

import pandas as pd
from tqdm import tqdm

df = pd.concat([chunk for chunk in tqdm(pd.read_csv(file_name, chunksize=1000), desc='Loading data')])

If you know the total lines to be loaded, you can add that information with the parameter total to the tqdm fuction, resulting in a percentage output.如果您知道要加载的总行数,则可以将该信息与参数total添加到 tqdm 函数中,得到百分比 output。

DISCLAIMER: This works only with xlrd engine and is not thoroughly tested!免责声明:这仅适用于xlrd引擎并且没有经过彻底测试!

How it works?这个怎么运作? We monkey-patch xlrd.xlsx.X12Sheet.own_process_stream method that is responsible to load sheets from file-like stream.我们猴子补丁xlrd.xlsx.X12Sheet.own_process_stream方法负责从类文件流加载工作表。 We supply own stream, that contains our progress bar.我们提供自己的流,其中包含我们的进度条。 Each sheet has it's own progress bar.每个工作表都有自己的进度条。

When we want the progress bar, we use load_with_progressbar() context manager and then do pd.read_excel('<FILE.xlsx>') .当我们想要进度条时,我们使用load_with_progressbar()上下文管理器,然后执行pd.read_excel('<FILE.xlsx>')

import xlrd
from tqdm import tqdm
from io import RawIOBase
from contextlib import contextmanager

class progress_reader(RawIOBase):
    def __init__(self, zf, bar):
        self.bar = bar
        self.zf = zf

    def readinto(self, b):
        n = self.zf.readinto(b)
        return n

def load_with_progressbar():

    def my_get_sheet(self, zf, *other, **kwargs):
        with tqdm(total=zf._orig_file_size) as bar:
            sheet = _tmp(self, progress_reader(zf, bar), **kwargs)
        return sheet

    _tmp = xlrd.xlsx.X12Sheet.own_process_stream

        xlrd.xlsx.X12Sheet.own_process_stream = my_get_sheet
        xlrd.xlsx.X12Sheet.own_process_stream = _tmp

import pandas as pd

with load_with_progressbar():
    df = pd.read_excel('sample2.xlsx')


Screenshot of progress bar:进度条截图:


This might help for people with similar problem.这可能对有类似问题的人有所帮助。 here you can get help 在这里你可以获得帮助

for example:例如:

for i in tqdm(range(0,3), ncols = 100, desc ="Loading data.."): 
    LC_data=pd.read_excel("some_file.xlsx",'Sheet1', header=None)
    FC_data=pd.read_excel("some_file.xlsx",'Shee2', header=None)    
print("------Loading is completed ------")

The following is based on user's rocksportrocker excellent answer.以下是根据网友rocksportrocker的优秀回答。

  • I am a Python beginner!我是Python初学者!
  • Below, please find my first version of using user rocksportrocker's recommendation.下面请看我使用rocksportrocker用户推荐的第一个版本。

import pandas as pd

print("Info: Loading starting.")

# https://stackoverflow.com/questions/52209290
temp = [];
myCounter = 1;
myChunksize = 10000;
# https://stackoverflow.com/questions/24251219/
for myChunk in pd.read_csv('YourFileName.csv', chunksize = myChunksize, low_memory = False):
    print('# of rows processed: ', myCounter*myChunksize)
    myCounter = myCounter + 1;
print("Info: Loading complete.")

# https://stackoverflow.com/questions/33642951
df = pd.concat(temp, ignore_index = True)


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM