简体   繁体   English

Python:读取巨大文本文件的特定部分(可能使用 Itertools)

[英]Python: Reading Specific Sections of Huge Text File (Possibly with Itertools)

In short, I'm trying to "extract" certain lines (strings) from a text file.简而言之,我试图从文本文件中“提取”某些行(字符串)。 But there's more.但还有更多。

I have a rather large text file (100,000 lines, 60 MB).我有一个相当大的文本文件(100,000 行,60 MB)。 There are chunks of data that are important, and others that are not.有些数据块很重要,有些则不重要。 There are several hundred of these chunks.这些块有数百个。 There is no pattern, and where one stops, the next one does not necessarily begin.没有规律,一个停止的地方,下一个不一定开始。

I have already parsed the file to determine which lines are of interest to me.我已经分析了文件以确定我对哪些行感兴趣。 Right now, I have a dictionary which contains "start" line numbers as keys, and the desired number of consecutive lines afterwards as values.现在,我有一个字典,其中包含“开始”行号作为键,之后包含所需的连续行数作为值。 Here:这里:

paired_points =
{51: 7,
 69: 67,
...
 870623: 1730,
 872364: 1801}



len(paired_points) = 
783

I can convert this to explicit "start" and "stop" integers instead (eg, 51 -> 58, 69 -> 136, etc.), but that still doesn't help me.我可以将其转换为明确的“开始”和“停止”整数(例如,51 -> 58、69 -> 136 等),但这仍然对我没有帮助。

I'm trying to use islice from itertools, but it's returning a list of islice objects.我正在尝试使用 itertools 中的 islice,但它返回了一个 islice 对象列表。

from itertools import islice

file = r'575852.roi'

f = open(file, "r")

a = list()

for key in paired_points:
    with open(file) as f:
        try:
            a.append(islice(f, key, key + int(paired_points[key]))) # Start and stop lines

This works in concept - but I need to convert islice objects to strings.这在概念上是有效的——但我需要将 islice 对象转换为字符串。 I mean, I'm looking for a list of lines (strings) from the text file.我的意思是,我正在寻找文本文件中的行(字符串)列表。

Any help would be greatly appreciated.任何帮助将不胜感激。 Thank you in advanced!提前谢谢你!

SOLUTION解决方案

I've solved this myself (to convert lines of interest to strings, then to an array of floats).我自己解决了这个问题(将感兴趣的行转换为字符串,然后转换为浮点数组)。 I actually needed to "sanitize" each line as well -- by splitting the text line into three float values (correlating to (X, Y, Z) coordinates).实际上,我还需要“清理”每一行——通过将文本行拆分为三个浮点值(与 (X, Y, Z) 坐标相关)。 This is performed with the built-in map() function in the last line, after we have built a list of strings.这是在我们构建了一个字符串列表之后,在最后一行使用内置的 map() function 执行的。

f = open(file, "r")
a = f.readlines()
f.close()

ext_pts = list()
for key in paired_points:
    a1 = a[key : key + paired_points[key]]
    ext_pts.append(a1)

ext_pts2 = list(itertools.chain.from_iterable(ext_pts))
ext_pts2 = np.asarray(list(map(sanitize, ext_pts2)))

ext_pts is now an Nx3 numpy array of (X, Y, Z) points. ext_pts 现在是 Nx3 numpy (X, Y, Z) 点数组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM