在Python中从大型文本文件读取数据块

Question

Dear all, 亲爱的大家，

I am trying to read a very large text file (>100 GB) containing covariances between different variables. 我正在尝试读取一个非常大的文本文件（> 100 GB），其中包含不同变量之间的协方差。 The arrangement is such that the first variable is related to all, second is related to all except the first one (eg, 14766203 or line[0:19] in the figure), and so on (see 1,2, 3 in figure). 这样的安排使得第一个变量与所有变量有关，第二个变量与除第一个变量之外的所有变量有关（例如，14766203或图中的line [0:19]），依此类推（参见图中的1,2，3））。 Here is my sample data: 这是我的示例数据：

14766203               -10.254364177  105.401485677     0.0049     0.0119       0.0024       0.0014      88.3946    7.340657124e-06   -7.137818870e-06    1.521836659e-06    
                                                                                                                                       3.367715952e-05   -6.261063214e-06    
                                                                                                                                                          3.105358202e-06    
14766204                                                                                                            6.126218197e-06   -7.264675283e-06    1.508365235e-06    
                                                                                                                   -7.406839249e-06    3.152004956e-05   -6.020433814e-06    
                                                                                                                    1.576663440e-06   -6.131501924e-06    2.813007315e-06    
14766205                                                                                                            4.485532069e-06   -6.601931549e-06    1.508397490e-06    
                                                                                                                   -7.243398379e-06    2.870296214e-05   -5.777139540e-06    
                                                                                                                    1.798277242e-06   -6.343898734e-06    2.291452454e-06    
14766204               -10.254727963  105.401101357     0.0065     0.0147       0.0031       0.0019      87.2542    1.293562659e-05   -1.188084039e-05    1.932569051e-06    
                                                                                                                                       5.177847716e-05   -7.850639841e-06    
                                                                                                                                                          4.963314613e-06    
14766205                                                                                                            6.259830057e-06   -8.072416685e-06    1.785233052e-06    
                                                                                                                   -8.854538457e-06    3.629463550e-05   -6.703120240e-06    
                                                                                                                    2.047196889e-06   -7.229432710e-06    2.917899913e-06    
14766205               -10.254905775  105.400622259     0.0051     0.0149       0.0024       0.0016      88.4723    9.566876325e-06   -1.357014809e-05    2.378290143e-06    
                                                                                                                                       5.210766141e-05   -8.356178456e-06    
                                                                                                                                                          4.016328161e-06

Now I wanted to be able to extract them as blocks in python or at the least read one block and exit the file read (eg, 1, 2, 3). 现在，我希望能够将它们提取为python中的块，或者至少读取一个块并退出读取的文件（例如1、2、3）。 I couldn't succeed but here is my effort: 我无法成功，但这是我的努力：

with open(inFile, 'rb') as f: listData = []
   for line in f: 
   MarkNumber = None; 
   if line[0:19].strip() != '' and line[23:36].strip() !='':
      MarkNumber = str(line[0:19].strip())                                                        
   if line[0:19].strip() == MarkNumber and len(line[23:36].strip()) !=0:
       isMark = True                                                
   if line[0:19].strip() != MarkNumber and len(line[23:36].strip()) !=0:
       isMark = False                                               
   if isMark == True:                                               
        ListOfData.append(line)

The ListOfData tends to read all lines until the end of file. ListOfData倾向于读取所有行，直到文件末尾。 So it does not really help. 因此，它并没有真正的帮助。

Any help to get this thing sorted out will be appreciated. 任何帮助解决此问题的方法将不胜感激。

Thanks Nakhap 谢谢Nakhap

Answer 1

Could you use regex to find blocks that have a second chunk of numbers within, say, 15 characters of the chunk of numbers that begins a line? 您是否可以使用正则表达式查找包含第二个数字块的块，例如在以一行开头的数字块的15个字符内？

import re

inFile = 'C:/path/myData.txt'
myregex = r'(^[.0-9e-]{1,15}[\W]{1,15}[.0-9e-])'
thisBlock = []

with open(inFile, 'rb') as f:
  for line in f:
  if re.search(re.compile(myregex),line):
       print("Previous block finished.  Here it is in a chunk:")
       print(thisBlock)

       print("\n\n\nNew block starting")
       thisBlock = [line]
  else:
       thisBlock.append(line)

The regex expression myregex looks for 1 to 15 {1,15} characters of whitespace [\\W] in between two numeric sequences [.0-9e-] -- this looks for digits 0-9, as well as decimal points, negative signs, and the exponent e. regex表达式myregex在两个数字序列[.0-9e-]之间寻找1至15个{1,15} [\\W] {1,15}空格[\\W]字符-查找数字0-9，以及小数点（负）符号和指数e。 The first {1,15} in the expression assumes that your first numeric expression at the start of the row is at least 1, but less than 15, characters. 表达式中的第一个{1,15}假定您在行开头的第一个数字表达式至少包含1个字符，但少于15个字符。

在Python中从大型文本文件读取数据块

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-02-12 04:56:57

在Python中从大型文本文件读取数据块

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-02-12 04:56:57

解决方案1
0 已采纳 2018-02-12 04:56:57