[英]How to make a regex for table data in Python
I have one file which is a.Txt file and i want to make a regex which can parse some kind of data from it.我有一个文件,它是一个.Txt 文件,我想制作一个可以从中解析某种数据的正则表达式。
I Have tried to do that, But I am not able to get what i am looking for我试过这样做,但我无法得到我想要的东西
This is one kind of TABLE data, formation maybe same for other files这是一种 TABLE 数据,格式可能与其他文件相同
Here I am adding those data, kindly consider it as a.Txt file我在这里添加这些数据,请将其视为.txt 文件
Help will be appreciated.帮助将不胜感激。
Tribhuwan Diagnostic Centre | HOSPITALROAD, Morne)
East Champaran- 845401 (Bihar)
(FULLY AUTOMATED & COMPUTERISED LAB) Mob. :+9162046 29003
Name HAJAN sadshaj Booking Date 22/s/2020
G/A male 18 Yrs Reporting Date 22/05/2020
Lab No. 10203693 Sample Collected At Lab
Ref. By Dr. I.C.U
; UVLO
Test Name Value Unit Biological Ref Interval
COMPLETE BLOOD COUNT (CBC)
TOTAL LEUCOCYTES COUNT (TLC) 23160 cells/cmm 4000 - 11000
DIFFERENTIAL LEUCOCYTES COUNT (DLC)
NEUTROPHILS 93.4 % 45.0 - 65.0
LYMPHOCYTES 3.3 % 20.0 - 45.0
MONOCYTES 3.1 % 4.0 - 10.0
EOSINOPHILS 0.2 % 0.0 - 5.0
BASOPHILS 0.0 % 0.0-1.0
ABSOLUTE NEUTROPHILS 21620.0 3000.0 - 7000.0
ABSOLUTE LYMPHOCYTES 750.0 800.0 - 4000.0
ABSOLUTE MONOCYTES 730.0 0.0 - 1200.0
ABSOLUTE EOSINOPHILS 50.0 0.0 - 500.0
ABSOLUTE BASOPHILS 10.0 0.0 - 100.0
RBC COUNT 4.31 Millions/cmm 3.80 - 5.80
HAEMOGLOBIN (Hb) 13.1 gm/dl 11.0 - 16.5
P.C.V/HCT 41.2 % 35.0 - 50.0
MCV 95.5 fl. 80.0 - 97.0
MCH 30.3 Picogram 26.5 - 35.5
MCHC 31.8 g/dl 31.5-35.5
RDW / SD 49.7 FI 37.0 - 54.0
RDW / CV 12.3 % 10.0 - 15.0
PLATELET COUNT 148000 /cmm 150000 - 450000
PDW 17.0 fl 10.0 - 18.0
MPV 13.3 fl 6.5 - 11.7
PCT 0.198 % 0.108 - 0.282
Le
_
I want to get only first two columns from this我只想从中获得前两列
output I want (Test Name, Value ): output 我想要(测试名称,值):
TOTAL LEUCOCYTES COUNT (TLC) 23160
DIFFERENTIAL LEUCOCYTES COUNT (DLC)
NEUTROPHILS 93.4
LYMPHOCYTES 3.3
MONOCYTES 3.1
EOSINOPHILS 0.2
BASOPHILS 0.0
ABSOLUTE NEUTROPHILS 21620.0
ABSOLUTE LYMPHOCYTES 750.0
ABSOLUTE MONOCYTES 730.0
ABSOLUTE EOSINOPHILS 50.0
ABSOLUTE BASOPHILS 10.0
RBC COUNT 4.31
HAEMOGLOBIN (Hb) 13.1
P.C.V/HCT 41.2
MCV 95.5
MCH 30.3
MCHC 31.8
RDW / SD 49.7
RDW / CV 12.3
PLATELET COUNT 148000
PDW 17.0
MPV 13.3
PCT 0.198
This kind of data is hard to parse with regex, but you can try this one (probably it will need adjusting for other text files) ( regex101 ):这种数据很难用正则表达式解析,但你可以试试这个(可能需要针对其他文本文件进行调整)( regex101 ):
import re
# variable `txt` is your text file from question
for col1, col2 in re.findall(r'^\s{13}([A-Z.]{2}[^\n\d]*[A-Z)])(?:\s*([\d.]+)|[^$])', txt, flags=re.MULTILINE):
print('{:<50}{}'.format(col1, col2))
Prints:印刷:
TOTAL LEUCOCYTES COUNT (TLC) 23160
DIFFERENTIAL LEUCOCYTES COUNT (DLC)
NEUTROPHILS 93.4
LYMPHOCYTES 3.3
MONOCYTES 3.1
EOSINOPHILS 0.2
BASOPHILS 0.0
ABSOLUTE NEUTROPHILS 21620.0
ABSOLUTE LYMPHOCYTES 750.0
ABSOLUTE MONOCYTES 730.0
ABSOLUTE EOSINOPHILS 50.0
ABSOLUTE BASOPHILS 10.0
RBC COUNT 4.31
HAEMOGLOBIN (Hb) 13.1
P.C.V/HCT 41.2
MCV 95.5
MCH 30.3
MCHC 31.8
RDW / SD 49.7
RDW / CV 12.3
PLATELET COUNT 148000
PDW 17.0
MPV 13.3
PCT 0.198
You can use the python regex library to achieve what you want.您可以使用python 正则表达式库来实现您想要的。 I started to write a regex for your problem, but didn't finished it.
我开始为您的问题编写一个正则表达式,但没有完成。 I'll update my post when I'll arrive to something satisfying.
当我到达令人满意的地方时,我会更新我的帖子。
Currently, the regex expression is matching the first and second columns of each line that starts with blank characters, have a first alphanumerical column and a second numerical column.目前,正则表达式匹配以空白字符开头的每一行的第一列和第二列,具有第一个字母数字列和第二个数字列。 We need to add the match on lines with only one column.
我们需要在只有一列的行上添加匹配项。
^\s+([[a-zA-Z()\/. ]+)\s+(\d+.\d+)
You can write and test your regexes easily on regex101.com , it allows you to visualize easily what they are doing to debug them.您可以在regex101.com上轻松编写和测试您的正则表达式,它使您可以轻松地可视化他们正在做什么来调试它们。
[EDIT] [编辑]
This one should do the trick, but you need to clean up a bit your input string before passing through the regex.这个应该可以解决问题,但是您需要在通过正则表达式之前清理一下输入字符串。 Assuming that the title
COMPLETE BLOOD COUNT (CBC)
will always be present, you can call the python find
function and remove the previous characters.假设标题
COMPLETE BLOOD COUNT (CBC)
会一直存在,你可以拨打 python find
function 并删除前面的字符。
(^\s+([[a-zA-Z()\/. ]+)\s+((\d+.\d+)))|(^\s+(([[a-zA-Z()\/. ]+))\s+\R)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.