简体   繁体   中英

data extraction - retrieving numerical / tabular / table data from text

I am looking for a generic method to extract table data from text files for further processing. So far I have been trying regular expressions, but it is difficult to create a generic regular expression to match any type of table.

For example, the following expression r'\\s*([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]*' can get lines with 7 repeating structures, and may work for some tables with 7 columns, but not other tables.

I would like this to work with any tabular type of data.

For instance, if given the following file, how would we get only the text associated with the block of numbers under Peak Retention Time .. Area :

Data \CH32\1\TA\C1 25-12-01 113.D
Sale ame: 0.e i12ol td dcane
=====================================================================

Inion Dae  12/2/201522:49 AM        1
        3-40\1201150000013.D)

pA
0cdc0ls,c
d0s00soskdckkksdn   d s s s    c d
wec cd  e   ww    ff 44  33
d00239390 v3920 2914

=====================================================================
                         Report                         
=====================================================================

Peak RetTime Type  Width     Area      Height     Area  
  #   [min]        [min]   [pA*s]      [pA]         %
----|-------|----|-------|----------|----------|--------|
   1   5.626 BB    0.0285   70.98110   33.85870  0.02974
   2   7.668 BV    0.0197    1.27084    1.05425  0.00053
   3   7.705 VB    0.0440  991.41168  295.00864  0.41536
   4  15.050 BB    0.0717   27.99529    5.86073  0.01173
   5  22.741 BB    0.0549   28.72847    7.52583  0.01204
   6  27.772 BB    0.0857 6380.34424 1010.32770  2.67309
   7  32.625 BB    0.0622   53.88815   13.59589  0.02258
   8  33.983 BB    0.0825   32.05646    6.21824  0.01343
   9  39.923 BB    0.0885 5314.40723  810.15796  2.22651
  10  43.925 BB    0.0765   59.07787   11.86150  0.02475
  11  50.097 BB    0.1174   73.53716    8.59922  0.03081

Boer 12/2/2015 2:51:48 PM SYSM ji  uo

Page  1 of 2

Daa M32\1\D50000013.D
Samme: 0.1M C1ne

Peak RetTime Type  Width     Area      Height     Area  
  #   [min]        [min]   [pA*s]      [pA]         %
----|-------|----|-------|----------|----------|--------|
  12  50.559 BB    0.1155  301.26007   38.39135  0.12621
  13  50.987 BB    0.1350  345.99808   34.16363  0.14496
  14  52.104 BB    0.1661  442.23685   34.55222  0.18528
  15  55.379 BV    0.3489 1.53736e5  5236.02783 64.40893
  16  55.579 VV    0.1331 6.97356e4  6460.92188 29.21619
  17  55.660 VB    0.0514  246.26105   65.02493  0.10317
  18  55.912 BB    0.0481  128.64572   40.64377  0.05390
  19  56.579 BB    0.0585    9.56895    2.53396  0.00401
  20  56.816 BB    0.0916   49.91595    7.31901  0.02091
  21  57.096 BV    0.0680   53.82137   11.70772  0.02255
  22  57.206 VV    0.0700   74.57529   16.61059  0.03124
  23  57.308 VV    0.0602   58.06633   14.30510  0.02433
  24  57.394 VB    0.0592   21.84551    5.31062  0.00915
  25  57.884 BV    0.0613   24.52355    6.20524  0.01027
  26  57.977 VB    0.0644   16.60599    3.94051  0.00696
  27  58.588 BV    0.0976   99.51610   14.22009  0.04169
  28  58.776 VV    0.0513   90.90850   28.12324  0.03809
  29  58.880 VV    0.0560   38.78033   10.66278  0.01625
  30  59.027 VB    0.0640   23.14709    5.72642  0.00970
  31  59.474 BB    0.0467   57.09470   19.18639  0.02392
  32  60.475 BB    0.0409   46.53337   17.34933  0.01950
  33  60.824 BB    0.0357   43.52694   19.47348  0.01824
  34  63.154 BB    0.0360    6.17513    2.64891  0.00259
  35  64.077 BB    0.0273    3.35928    1.95091  0.00141

Totals :                  2.38688e5  1.43011e4 


=====================================================================
                          *** End of Report ***

Page  2 of 2

Is there any regex, pattern recognition package, or other type of (preferably python) package solution to this problem?

    import re   

    chem = open('chem.txt', 'r')        
pattern  = r'\s+\d+\s+([\d.]+)\s+[A-Z]+\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)'

for l in chem.readlines():
  match  = re.search(pattern, l)
  if match:
    ret_time, width, area_pas, height, area_pct = match.group(1), match.group(2), match.group(3), match.group(4), match.group(5)
    #write these to file??
    print (ret_time, width, area_pas, height, area_pct)

You may need to refactor and add exception handling

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM