[英]Optimize file and number line count in Python
I got a python project with many folders, files (.css, .py, .yml, etc.) and lines of code. 我得到了一个包含许多文件夹,文件(.css,.py,.yml等)和代码行的python项目。 For this project, I made a tool called "statistics" that gives me informations about the entire project such as :
对于这个项目,我制作了一个名为“统计信息”的工具,可为我提供有关整个项目的信息,例如:
Global statistics:
全球统计:
Entire project :: 32329 lines
整个项目:: 32329行
Project main files (.py, .yml) :: 8420 lines项目主文件(.py,.yml):: 8420行
Project without vendor part :: 1070 lines没有供应商零件的项目:: 1070行
Core (src directory) :: 394 lines核心(src目录):: 394行
Core compared to project main files :: 5 % Kraken Framework (vendor/*.py) :: 7350 lines与项目主文件相比的核心:: 5%Kraken Framework(vendor / *。py):: 7350行
Main files Python code :: 93 %主文件Python代码:: 93%
Vendor Python code :: 87 %供应商Python代码:: 87%
Entire project size :: 37M整个项目规模:: 37M
To get all these numbers, I mainly use two functions: 为了获得所有这些数字,我主要使用两个函数:
def count_folder_lines(self, path):
files = glob.glob(path, recursive=True)
number = 0
for file in files:
num_lines = sum(1 for line in open(file))
number += num_lines
return number
and 和
def count_number_of_files(self, path):
files = glob.glob(path, recursive=True)
return len(files)
The first one is used to count the number of lines in a folder and the second one is used to count the number of specific files (ex: src/*.py). 第一个用于计算文件夹中的行数,第二个用于计算特定文件的数量(例如:src / *。py)。 But to get the project's statistics, it takes between 4.9 and 5.3 seconds, which is a lot.
但是要获得项目的统计数据,需要花费4.9到5.3秒,这是很多时间。
Is there any way to make it faster ? 有什么方法可以使其更快? Does parallel programming or using Cython would change something ?
并行编程或使用Cython会改变某些东西吗?
Have a nice day, Thank you. 祝你有美好的一天,谢谢。
Finally found the most efficient solution for me : I'm using multiprocessing module to count the number of lines of each file in parallel. 终于找到了对我来说最有效的解决方案:我正在使用多处理模块来并行计算每个文件的行数。
def count_folder_lines(self, path):
"""
Use a buffer to count the number of line of each file among path.
:param path: string pattern of a file type
:return: number of lines in matching files
"""
files = glob.glob(path, recursive=True)
number = 0
for file in files:
f = open(file, 'rb')
bufgen = takewhile(lambda x: x,
(f.raw.read(1024 * 1024) for _ in repeat(None)))
number += sum(buf.count(b'\n') for buf in bufgen if buf)
return number
def count_number_of_files(self, path):
"""
Count number of files for a string pattern
:param path: files string pattern
:return: number of files matching the pattern
"""
files = glob.glob(path, recursive=True)
return len(files)
def multiproc(self):
"""
Multiprocessing to launch several processes to count number of
lines of each string pattern in self.files
:return: List of number of files per string pattern
(list of int).
"""
pool = mp.Pool()
asyncResult = pool.map_async(self.count_folder_lines, self.files)
return asyncResult.get()
With this solution, it takes ~1.2s to count versus ~5s before. 使用此解决方案,计数所需的时间约为1.2s,而之前约为5s。
Have a good day! 祝你有美好的一天!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.