[英]Python: How to loop through blocks of lines and copy specific text within lines
Input file: 输入文件:
DATE: 07/01/15 @ 0800 HYRULE HOSPITAL PAGE 1
USER: LINK Antibiotic Resistance Report
--------------------------------------------------------------------------------------------
Activity Date Range: 01/01/15 - 02/01/15
--------------------------------------------------------------------------------------------
HH0000000001 LINK,DARK 30/M <DIS IN 01/05> (UJ00000001) A001-01 0A ZELDA,PRINCESS MD
15:M0000001R COMP, Coll: 01/02/15-0800 Recd: 01/02/15-0850 (R#00000001) ZELDA,PRINCESS MD
Source: SPUTUM
PSEUDOMONAS FLUORESCENS LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
HH0000000002 FAIRY,GREAT 25/F <DIS IN 01/06> (UJ00000002) A002-01 0A ZELDA,PRINCESS MD
15:M0000002R COMP, Coll: 01/03/15-2025 Recd: 01/03/15-2035 (R#00000002) ZELDA,PRINCESS MD
Source: URINE- STRAIGHT CATH
PROTEUS MIRABILIS CEFTRIAXONE-other R
--------------------------------------------------------------------------------------------
HH0000000003 MAN,OLD 85/M <DIS IN 01/07> (UJ00000003) A003-01 0A ZELDA,PRINCESS MD
15:M0000003R COMP, Coll: 01/04/15-1800 Recd: 01/04/15-1800 (R#00000003) ZELDA,PRINCESS MD
Source: URINE-CLEAN VOIDED SPEC
ESCHERICHIA COLI LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
Completely new to programming/scripting and Python. 编程/脚本和Python的全新知识。 How do you recommend looping through this sample input to grab specific text in the fields?
您如何建议循环浏览此示例输入以获取字段中的特定文本?
Each patient has a unique identifier (eg HH0000000001). 每个患者都有一个唯一的标识符(例如HH0000000001)。 I want to grab specific text from each line.
我想从每一行中获取特定的文本。
Output should look like: 输出应如下所示:
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK, DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY, GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
Edit: My current code looks like this: 编辑:我当前的代码如下所示:
(Disclaimer: I am fumbling around in the dark, so the code is not going to be pretty at all. (免责声明:我在黑暗中摸索,所以代码根本不会很漂亮。
input = open('report.txt')
output = open('abx.txt', 'w')
date = '' # Defining global variables outside of the loop
time = ''
name = ''
name_last = ''
name_first = ''
account = ''
specimen = ''
source = ''
output.write('Date|Time|Name|Account|Specimen|Source\n')
lines = input.readlines()
for index, line in enumerate(lines):
print index, line
if last_line_location:
new_patient = True
if not first_time_through:
output.write("{}|{}|{}, {}|{}|{}|{}\n".format(
'Date', # temporary placeholder
'Time', # temporary placeholder
name_last.capitalize(),
name_first.capitalize(),
account,
'Specimen', # temporary placeholder
'Source' # temporary placeholder
) )
last_line_location = False
first_time_through = False
for each in lines:
if line.startswith('HH'): # Extract account and name
account = line.split()[0]
name = line.split()[1]
name_last = name.split(',')[0]
name_first = name.split(',')[1]
last_line_location = True
input.close()
output.close()
Currently, the output will skip the first patient and will only display information for the 2nd and 3rd patient. 当前,输出将跳过第一位患者,并且仅显示第二位和第三位患者的信息。 Output looks like this:
输出看起来像这样:
Date|Time|Name|Account|Specimen|Source
Date|Time|Fairy, Great|HH0000000002|Specimen|Source
Date|Time|Man, Old|HH0000000003|Specimen|Source
Please feel free to make suggestions on how to improve any aspect of this, including output style or overall strategy. 请随时就如何改进此方面的任何方面提出建议,包括输出风格或总体策略。
You code actually works if you add... 如果添加,您的代码实际上就可以使用...
last_line_location = True
first_time_through = True
...before your for loop ...在for循环之前
You asked for pointers as well though... 您也要求提供指针...
As has been suggested in the comments, you could look at the re
module. 如评论中所建议,您可以查看
re
模块。
I've knocked something together that shows this. 我已经敲了一些东西,显示了这一点。 It may not be suitable for all data because three records is a very small sample, and I've made some assumptions.
它可能不适合所有数据,因为三个记录是一个很小的样本,并且我作了一些假设。
The last item is also quite contrived because there's nothing definite to search for (such as Coll
, Source
). 最后一项也很人为,因为没有确定的要搜索的内容(例如
Coll
, Source
)。 It will fail if there are no spaces at the start of the final line, for example. 例如,如果最后一行的开头没有空格,它将失败。
This code is merely a suggestion of another way of doing things: 该代码仅是另一种处理方式的建议:
import re
startflag = False
with open('report.txt','r') as infile:
with open('abx.txt','w') as outfile:
outfile.write('Date|Time|Name|Account|Specimen|Source|Antibiotic\n')
for line in infile:
if '---------------' in line:
if startflag:
outfile.write('|'.join((date, time, name, account, spec, source, anti))+'\n')
else:
startflag = True
continue
if 'Activity' in line:
startflag = False
acc_name = re.findall('HH\d+ \w+,\w+', line)
if acc_name:
account, name = acc_name[0].split(' ')
date_time = re.findall('(?<=Coll: ).+(?= Recd:)', line)
if date_time:
date, time = date_time[0].split('-')
source_re = re.findall('(?<=Source: ).+',line)
if source_re:
source = source_re[0].strip()
anti_spec = re.findall('^ +(?!Source)\w+ *\w+ + \S+', line)
if anti_spec:
stripped_list = anti_spec[0].strip().split()
anti = stripped_list[-1]
spec = ' '.join(stripped_list[:-1])
Output 输出量
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK,DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY,GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
01/04/15|1800|MAN,OLD|HH0000000003|ESCHERICHIA COLI|URINE-CLEAN VOIDED SPEC|LEVOFLOXACIN
Edit: 编辑:
Obviously, the variables should be reset to some dummy value between writes on case of a corrupt record. 显然,在记录损坏的情况下,两次写入之间应将变量重置为某个伪值。 Also, if there is no line of dashes after the last record it won't get written as it stands.
另外,如果最后一条记录后没有破折号,则不会按原样写入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.