在Python中优化文件和数字行数

Question

I got a python project with many folders, files (.css, .py, .yml, etc.) and lines of code. 我得到了一个包含许多文件夹，文件（.css，.py，.yml等）和代码行的python项目。 For this project, I made a tool called "statistics" that gives me informations about the entire project such as : 对于这个项目，我制作了一个名为“统计信息”的工具，可为我提供有关整个项目的信息，例如：

Global statistics: 全球统计：

Entire project :: 32329 lines 整个项目:: 32329行
Project main files (.py, .yml) :: 8420 lines 项目主文件（.py，.yml）:: 8420行
Project without vendor part :: 1070 lines 没有供应商零件的项目:: 1070行
Core (src directory) :: 394 lines 核心（src目录）:: 394行
Core compared to project main files :: 5 % Kraken Framework (vendor/*.py) :: 7350 lines 与项目主文件相比的核心:: 5％Kraken Framework（vendor / *。py）:: 7350行
Main files Python code :: 93 % 主文件Python代码:: 93％
Vendor Python code :: 87 % 供应商Python代码:: 87％
Entire project size :: 37M 整个项目规模：： 37M

To get all these numbers, I mainly use two functions: 为了获得所有这些数字，我主要使用两个函数：

def count_folder_lines(self, path):
    files = glob.glob(path, recursive=True)
    number = 0
    for file in files:
        num_lines = sum(1 for line in open(file))
        number += num_lines
    return number

and 和

def count_number_of_files(self, path):
    files = glob.glob(path, recursive=True)
    return len(files)

The first one is used to count the number of lines in a folder and the second one is used to count the number of specific files (ex: src/*.py). 第一个用于计算文件夹中的行数，第二个用于计算特定文件的数量（例如：src / *。py）。 But to get the project's statistics, it takes between 4.9 and 5.3 seconds, which is a lot. 但是要获得项目的统计数据，需要花费4.9到5.3秒，这是很多时间。

Is there any way to make it faster ? 有什么方法可以使其更快？ Does parallel programming or using Cython would change something ? 并行编程或使用Cython会改变某些东西吗？

Have a nice day, Thank you. 祝你有美好的一天，谢谢。

Answer 1

Finally found the most efficient solution for me : I'm using multiprocessing module to count the number of lines of each file in parallel. 终于找到了对我来说最有效的解决方案：我正在使用多处理模块来并行计算每个文件的行数。

def count_folder_lines(self, path):
    """ 
        Use a buffer to count the number of line of each file among path.
        :param path: string pattern of a file type
        :return: number of lines in matching files
    """
    files = glob.glob(path, recursive=True)
    number = 0
    for file in files:
        f = open(file, 'rb')
        bufgen = takewhile(lambda x: x,
                           (f.raw.read(1024 * 1024) for _ in repeat(None)))
        number += sum(buf.count(b'\n') for buf in bufgen if buf)
    return number

def count_number_of_files(self, path):
    """
        Count number of files for a string pattern
        :param path: files string pattern
        :return: number of files matching the pattern
    """
    files = glob.glob(path, recursive=True)
    return len(files)

def multiproc(self):
    """
        Multiprocessing to launch several processes to count number of
        lines of each string pattern in self.files
        :return: List of number of files per string pattern
                    (list of int).
    """
    pool = mp.Pool()
    asyncResult = pool.map_async(self.count_folder_lines, self.files)
    return asyncResult.get()

With this solution, it takes ~1.2s to count versus ~5s before. 使用此解决方案，计数所需的时间约为1.2s，而之前约为5s。

Have a good day! 祝你有美好的一天！

在Python中优化文件和数字行数

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-12-06 08:15:18

在Python中优化文件和数字行数

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-12-06 08:15:18

解决方案1
0 已采纳 2017-12-06 08:15:18