[英]data extraction - retrieving numerical / tabular / table data from text
I am looking for a generic method to extract table data from text files for further processing. 我正在寻找一种从文本文件中提取表数据以进行进一步处理的通用方法。 So far I have been trying regular expressions, but it is difficult to create a generic regular expression to match any type of table. 到目前为止,我一直在尝试正则表达式,但是很难创建一个通用的正则表达式来匹配任何类型的表。
For example, the following expression r'\\s*([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]*'
can get lines with 7 repeating structures, and may work for some tables with 7 columns, but not other tables. 例如,以下表达式r'\\s*([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]+([\\d.\\w]+)[ \\h]*'
可以获得具有7个重复结构的行,并且可能适用于某些具有7列的表,但不适用于其他表。
I would like this to work with any tabular type of data. 我希望它可以处理任何表格类型的数据。
For instance, if given the following file, how would we get only the text associated with the block of numbers under Peak Retention Time .. Area
: 例如,如果给定以下文件,我们将如何仅获取与“ Peak Retention Time .. Area
下的数字块关联的文本:
Data \CH32\1\TA\C1 25-12-01 113.D
Sale ame: 0.e i12ol td dcane
=====================================================================
Inion Dae 12/2/201522:49 AM 1
3-40\1201150000013.D)
pA
0cdc0ls,c
d0s00soskdckkksdn d s s s c d
wec cd e ww ff 44 33
d00239390 v3920 2914
=====================================================================
Report
=====================================================================
Peak RetTime Type Width Area Height Area
# [min] [min] [pA*s] [pA] %
----|-------|----|-------|----------|----------|--------|
1 5.626 BB 0.0285 70.98110 33.85870 0.02974
2 7.668 BV 0.0197 1.27084 1.05425 0.00053
3 7.705 VB 0.0440 991.41168 295.00864 0.41536
4 15.050 BB 0.0717 27.99529 5.86073 0.01173
5 22.741 BB 0.0549 28.72847 7.52583 0.01204
6 27.772 BB 0.0857 6380.34424 1010.32770 2.67309
7 32.625 BB 0.0622 53.88815 13.59589 0.02258
8 33.983 BB 0.0825 32.05646 6.21824 0.01343
9 39.923 BB 0.0885 5314.40723 810.15796 2.22651
10 43.925 BB 0.0765 59.07787 11.86150 0.02475
11 50.097 BB 0.1174 73.53716 8.59922 0.03081
Boer 12/2/2015 2:51:48 PM SYSM ji uo
Page 1 of 2
Daa M32\1\D50000013.D
Samme: 0.1M C1ne
Peak RetTime Type Width Area Height Area
# [min] [min] [pA*s] [pA] %
----|-------|----|-------|----------|----------|--------|
12 50.559 BB 0.1155 301.26007 38.39135 0.12621
13 50.987 BB 0.1350 345.99808 34.16363 0.14496
14 52.104 BB 0.1661 442.23685 34.55222 0.18528
15 55.379 BV 0.3489 1.53736e5 5236.02783 64.40893
16 55.579 VV 0.1331 6.97356e4 6460.92188 29.21619
17 55.660 VB 0.0514 246.26105 65.02493 0.10317
18 55.912 BB 0.0481 128.64572 40.64377 0.05390
19 56.579 BB 0.0585 9.56895 2.53396 0.00401
20 56.816 BB 0.0916 49.91595 7.31901 0.02091
21 57.096 BV 0.0680 53.82137 11.70772 0.02255
22 57.206 VV 0.0700 74.57529 16.61059 0.03124
23 57.308 VV 0.0602 58.06633 14.30510 0.02433
24 57.394 VB 0.0592 21.84551 5.31062 0.00915
25 57.884 BV 0.0613 24.52355 6.20524 0.01027
26 57.977 VB 0.0644 16.60599 3.94051 0.00696
27 58.588 BV 0.0976 99.51610 14.22009 0.04169
28 58.776 VV 0.0513 90.90850 28.12324 0.03809
29 58.880 VV 0.0560 38.78033 10.66278 0.01625
30 59.027 VB 0.0640 23.14709 5.72642 0.00970
31 59.474 BB 0.0467 57.09470 19.18639 0.02392
32 60.475 BB 0.0409 46.53337 17.34933 0.01950
33 60.824 BB 0.0357 43.52694 19.47348 0.01824
34 63.154 BB 0.0360 6.17513 2.64891 0.00259
35 64.077 BB 0.0273 3.35928 1.95091 0.00141
Totals : 2.38688e5 1.43011e4
=====================================================================
*** End of Report ***
Page 2 of 2
Is there any regex, pattern recognition package, or other type of (preferably python) package solution to this problem? 是否有任何正则表达式,模式识别软件包或其他类型的(最好是python)软件包解决方案?
import re
chem = open('chem.txt', 'r')
pattern = r'\s+\d+\s+([\d.]+)\s+[A-Z]+\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)'
for l in chem.readlines():
match = re.search(pattern, l)
if match:
ret_time, width, area_pas, height, area_pct = match.group(1), match.group(2), match.group(3), match.group(4), match.group(5)
#write these to file??
print (ret_time, width, area_pas, height, area_pct)
You may need to refactor and add exception handling 您可能需要重构并添加异常处理
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.