简体   繁体   English

有效地将特定行复制到另一个文件(Python 3.3)

[英]Efficiently copying specific lines to another file (Python 3.3)

How to copy all even lines from one file to a new file in Python? 如何将所有偶数行从一个文件复制到Python中的新文件?

The even number is just an illustration when I want a very select though substantive number of lines copied from one file to another, but it should be good as an example. 偶数只是一个例子,当我想要从一个文件复制到另一个文件的非常精选的实际行数时,但它应该是一个很好的例子。

I use this, but it is very inefficient (it takes around 5 minutes): 我使用它,但效率非常低(大约需要5分钟):

# foo.txt holds 200,000 lines with 300 values
list = [0, 2, 4, 6, 8, 10..... 199996, 199998]
newfile = open(savefile, "w")
with open("foo.txt", "r") as file:
    for i, line in enumerate(file):
        if i in list:
            newfile.write(line)
newfile.close()

I would also appreciate it if there's an explanation why this is so slow: reading line by line goes quickly (around 15 seconds), and is also advised by the manual. 我也很感激,如果有一个解释为什么这么慢:逐行读取(大约15秒),并且手册也建议。

EDIT: My apologies; 编辑:我的道歉; I am not looking for specific odd/even examples; 我不是在寻找具体的奇数/偶数例子; it is merely for the effect of how to deal with around 100k out of 200k values in no easy order. 它仅仅是为了解决如何处理200k值中的大约100k而不是简单的顺序。 Is there no general solution to the I/O problem here other than finding more efficient ways to deal with odd/even? 除了找到更有效的方法来处理奇数/偶数之外,这里的I / O问题没有通用解决方案吗? Again apologies for bringing it up. 再次为此提出道歉。

What's taking all the time is searching list . 一直在寻找的是搜索list In order to figure out whether i is in list , it has to scan through the entire list to be sure that it's not there. 为了弄清楚i是否在list ,它必须扫描整个列表以确保它不存在。 If you really only care about even numbers, you can simply use if i % 2 == 0 , but if you have a specific group of line numbers you want, you should use a set , which has O(1) membership testing, eg 如果你真的只关心偶数,你可以简单地使用if i % 2 == 0 ,但是如果你有一个特定的行号,你应该使用一个具有O(1)成员资格测试的set ,例如

keep = {1, 5, 888, 20203} 

and then 接着

if i in keep:

You're spending tons of time creating and then repeatedly searching (on every line!!!) that monstrous list . 你需要花费大量的时间来创建,然后反复搜索(在每一行!!!)这个可怕的list Just read the first file line by line and skip every other. 只需逐行读取第一个文件并跳过其他文件。 You can either do this with a toggling flag, or just check if the line number is divisible by two (clearer, in my opinion). 您可以使用切换标记执行此操作,或者只检查行号是否可被2整除(在我看来更清晰)。

for i, line in enumerate(file):
    if i % 2 == 0:
        newfile.write(line)

EDIT in response to your edit: your question is now "how to copy arbitrary lines from a file?" 编辑以响应您的编辑:您的问题现在是“如何从文件中复制任意行?” That depends an awful lot on how those arbitrary lines are defined. 这很大程度上取决于如何定义这些任意行。 The answer still is definitely not to use a list of "wanted" line numbers, because searching that list takes a long time, and you'll have to search it on every line. 答案仍然绝对不是使用“通缉”行号列表,因为搜索该列表需要很长时间,而且你必须在每一行搜索它。

If the goal is essentially to be able to pick random lines from the file, you could use something similar to your current setup, but using set instead of list to make your lookup fast. 如果目标本质上是能够从文件中选择随机行,则可以使用类似于当前设置的内容,但使用set而不是list来快速查找。 A general-case proof-of-concept solution might look like this: 一般情况下的概念验证解决方案可能如下所示:

import random

# Pick 5000 random lines
wanted_lines = set(random.sample(range(200000), 5000)) # Use a set!
for i, line in enumerate(file):
    if i in wanted_lines: # average-case O(1)
        newfile.write(str(line)+'\n')

I'm assuming that your list is predefined, and can contain any sequence of possible line indices, not necessarily every Nth line for example. 我假设您的list是预定义的,并且可以包含任何可能的行索引序列,例如,不一定每个第N行。

The first probable bottleneck is that you're doing a O(n) list search ( i in list ) 200000 times. 第一个可能的瓶颈是你正在进行O(n)列表搜索( i in list )200000次。 Converting the list to a dictionary should already help: 将列表转换为字典应该已经有所帮助:

listd = dict.fromkeys(list)
.
.
   # this is O(1) instead of O(n)
   if i in listd:

Alternatively, if you know that list is sorted, or you can sort it, simply keep track of the next line index: 或者,如果您知道list已排序,或者您可以对其进行排序,只需跟踪下一行索引:

list = [0, 2, 4, 6, 8, 10..... 199996, 199998]
nextidx = 0
newfile = open(savefile, "w")
with open("foo.txt", "r") as file:
    for i, line in enumerate(file):
        if i == list[nextidx]:
            newfile.write(line)
            nextidx += 1
newfile.close()

something like this? 这样的事情?

flag = False
with open("test_async_db_access.py", "r") as file:
    for line in file:
        if flag:
            print line
        flag = not flag

This avoids having to use the large list 这避免了必须使用大型列表

Edit: If it is an arbitrary list of lines you want then use a map {} like DSM's answer this will perform the 'in' in O(1) time instead of O(n). 编辑:如果它是您想要的任意行列表,那么使用地图{},如DSM的答案,这将在O(1)时间内执行'in'而不是O(n)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM