在Python中从大型文本文件读取数据块

Question

亲爱的大家，

我正在尝试读取一个非常大的文本文件（> 100 GB），其中包含不同变量之间的协方差。 这样的安排使得第一个变量与所有变量有关，第二个变量与除第一个变量之外的所有变量有关（例如，14766203或图中的line [0:19]），依此类推（参见图中的1,2，3））。 这是我的示例数据：

14766203               -10.254364177  105.401485677     0.0049     0.0119       0.0024       0.0014      88.3946    7.340657124e-06   -7.137818870e-06    1.521836659e-06    
                                                                                                                                       3.367715952e-05   -6.261063214e-06    
                                                                                                                                                          3.105358202e-06    
14766204                                                                                                            6.126218197e-06   -7.264675283e-06    1.508365235e-06    
                                                                                                                   -7.406839249e-06    3.152004956e-05   -6.020433814e-06    
                                                                                                                    1.576663440e-06   -6.131501924e-06    2.813007315e-06    
14766205                                                                                                            4.485532069e-06   -6.601931549e-06    1.508397490e-06    
                                                                                                                   -7.243398379e-06    2.870296214e-05   -5.777139540e-06    
                                                                                                                    1.798277242e-06   -6.343898734e-06    2.291452454e-06    
14766204               -10.254727963  105.401101357     0.0065     0.0147       0.0031       0.0019      87.2542    1.293562659e-05   -1.188084039e-05    1.932569051e-06    
                                                                                                                                       5.177847716e-05   -7.850639841e-06    
                                                                                                                                                          4.963314613e-06    
14766205                                                                                                            6.259830057e-06   -8.072416685e-06    1.785233052e-06    
                                                                                                                   -8.854538457e-06    3.629463550e-05   -6.703120240e-06    
                                                                                                                    2.047196889e-06   -7.229432710e-06    2.917899913e-06    
14766205               -10.254905775  105.400622259     0.0051     0.0149       0.0024       0.0016      88.4723    9.566876325e-06   -1.357014809e-05    2.378290143e-06    
                                                                                                                                       5.210766141e-05   -8.356178456e-06    
                                                                                                                                                          4.016328161e-06

现在，我希望能够将它们提取为python中的块，或者至少读取一个块并退出读取的文件（例如1、2、3）。 我无法成功，但这是我的努力：

with open(inFile, 'rb') as f: listData = []
   for line in f: 
   MarkNumber = None; 
   if line[0:19].strip() != '' and line[23:36].strip() !='':
      MarkNumber = str(line[0:19].strip())                                                        
   if line[0:19].strip() == MarkNumber and len(line[23:36].strip()) !=0:
       isMark = True                                                
   if line[0:19].strip() != MarkNumber and len(line[23:36].strip()) !=0:
       isMark = False                                               
   if isMark == True:                                               
        ListOfData.append(line)

ListOfData倾向于读取所有行，直到文件末尾。 因此，它并没有真正的帮助。

任何帮助解决此问题的方法将不胜感激。

谢谢Nakhap

Answer 1

您是否可以使用正则表达式查找包含第二个数字块的块，例如在以一行开头的数字块的15个字符内？

import re

inFile = 'C:/path/myData.txt'
myregex = r'(^[.0-9e-]{1,15}[\W]{1,15}[.0-9e-])'
thisBlock = []

with open(inFile, 'rb') as f:
  for line in f:
  if re.search(re.compile(myregex),line):
       print("Previous block finished.  Here it is in a chunk:")
       print(thisBlock)

       print("\n\n\nNew block starting")
       thisBlock = [line]
  else:
       thisBlock.append(line)

regex表达式myregex在两个数字序列[.0-9e-]之间寻找1至15个{1,15} [\\W] {1,15}空格[\\W]字符-查找数字0-9，以及小数点（负）符号和指数e。 表达式中的第一个{1,15}假定您在行开头的第一个数字表达式至少包含1个字符，但少于15个字符。

在Python中从大型文本文件读取数据块

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-02-12 04:56:57

在Python中从大型文本文件读取数据块

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-02-12 04:56:57

解决方案1
0 已采纳 2018-02-12 04:56:57