简体   繁体   English

Python:如何遍历行块并在行内复制特定文本

[英]Python: How to loop through blocks of lines and copy specific text within lines

Input file: 输入文件:

DATE: 07/01/15 @ 0800                 HYRULE HOSPITAL                         PAGE 1
USER: LINK                      Antibiotic Resistance Report
--------------------------------------------------------------------------------------------
Activity Date Range: 01/01/15 - 02/01/15
--------------------------------------------------------------------------------------------
HH0000000001 LINK,DARK 30/M <DIS IN 01/05> (UJ00000001) A001-01 0A ZELDA,PRINCESS MD
15:M0000001R    COMP, Coll: 01/02/15-0800 Recd: 01/02/15-0850 (R#00000001) ZELDA,PRINCESS MD
    Source: SPUTUM                                  
       PSEUDOMONAS FLUORESCENS            LEVOFLOXACIN   >=8   R                            
--------------------------------------------------------------------------------------------
HH0000000002 FAIRY,GREAT   25/F <DIS IN 01/06> (UJ00000002) A002-01 0A ZELDA,PRINCESS MD    
15:M0000002R    COMP, Coll: 01/03/15-2025 Recd: 01/03/15-2035 (R#00000002) ZELDA,PRINCESS MD
    Source: URINE- STRAIGHT CATH                    
   PROTEUS MIRABILIS                  CEFTRIAXONE-other      R                          
--------------------------------------------------------------------------------------------
HH0000000003 MAN,OLD   85/M <DIS IN 01/07> (UJ00000003) A003-01 0A ZELDA,PRINCESS MD 
15:M0000003R    COMP, Coll: 01/04/15-1800 Recd: 01/04/15-1800 (R#00000003) ZELDA,PRINCESS MD
    Source: URINE-CLEAN VOIDED SPEC                 
   ESCHERICHIA COLI                   LEVOFLOXACIN   >=8   R                            
--------------------------------------------------------------------------------------------

Completely new to programming/scripting and Python. 编程/脚本和Python的全新知识。 How do you recommend looping through this sample input to grab specific text in the fields? 您如何建议循环浏览此示例输入以获取字段中的特定文本?

Each patient has a unique identifier (eg HH0000000001). 每个患者都有一个唯一的标识符(例如HH0000000001)。 I want to grab specific text from each line. 我想从每一行中获取特定的文本。

Output should look like: 输出应如下所示:

Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK, DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY, GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other

Edit: My current code looks like this: 编辑:我当前的代码如下所示:

(Disclaimer: I am fumbling around in the dark, so the code is not going to be pretty at all. (免责声明:我在黑暗中摸索,所以代码根本不会很漂亮。

input = open('report.txt')
output = open('abx.txt', 'w')

date = ''  # Defining global variables outside of the loop
time = ''
name = ''
name_last = ''
name_first = ''
account = ''
specimen = ''
source = ''

output.write('Date|Time|Name|Account|Specimen|Source\n')
lines = input.readlines()

for index, line in enumerate(lines):
    print index, line

    if last_line_location:
        new_patient = True
        if not first_time_through:
            output.write("{}|{}|{}, {}|{}|{}|{}\n".format(
                'Date', # temporary placeholder
                'Time', # temporary placeholder
                name_last.capitalize(),
                name_first.capitalize(),
                account,
                'Specimen', # temporary placeholder
                'Source' # temporary placeholder
                ) )
        last_line_location = False
        first_time_through = False

    for each in lines:
        if line.startswith('HH'):  # Extract account and name
            account = line.split()[0]
            name = line.split()[1]
            name_last = name.split(',')[0]
            name_first = name.split(',')[1]
            last_line_location = True

input.close()
output.close()

Currently, the output will skip the first patient and will only display information for the 2nd and 3rd patient. 当前,输出将跳过第一位患者,并且仅显示第二位和第三位患者的信息。 Output looks like this: 输出看起来像这样:

Date|Time|Name|Account|Specimen|Source
Date|Time|Fairy, Great|HH0000000002|Specimen|Source
Date|Time|Man, Old|HH0000000003|Specimen|Source

Please feel free to make suggestions on how to improve any aspect of this, including output style or overall strategy. 请随时就如何改进此方面的任何方面提出建议,包括输出风格或总体策略。

You code actually works if you add... 如果添加,您的代码实际上就可以使用...

last_line_location = True
first_time_through = True

...before your for loop ...在for循环之前

You asked for pointers as well though... 您也要求提供指针...

As has been suggested in the comments, you could look at the re module. 如评论中所建议,您可以查看re模块。

I've knocked something together that shows this. 我已经敲了一些东西,显示了这一点。 It may not be suitable for all data because three records is a very small sample, and I've made some assumptions. 它可能不适合所有数据,因为三个记录是一个很小的样本,并且我作了一些假设。
The last item is also quite contrived because there's nothing definite to search for (such as Coll , Source ). 最后一项也很人为,因为没有确定的要搜索的内容(例如CollSource )。 It will fail if there are no spaces at the start of the final line, for example. 例如,如果最后一行的开头没有空格,它将失败。

This code is merely a suggestion of another way of doing things: 该代码仅是另一种处理方式的建议:

import re

startflag = False
with open('report.txt','r') as infile:
    with open('abx.txt','w') as outfile:
        outfile.write('Date|Time|Name|Account|Specimen|Source|Antibiotic\n')
        for line in infile:
            if '---------------' in line:
                if startflag:
                    outfile.write('|'.join((date, time, name, account, spec, source, anti))+'\n')
                else:
                    startflag = True
                continue
            if 'Activity' in line:
                startflag = False

            acc_name = re.findall('HH\d+ \w+,\w+', line)
            if acc_name:
                account, name = acc_name[0].split(' ')

            date_time = re.findall('(?<=Coll: ).+(?= Recd:)', line)
            if date_time:
                date, time = date_time[0].split('-')

            source_re = re.findall('(?<=Source: ).+',line)
            if source_re:
                source = source_re[0].strip()

            anti_spec = re.findall('^ +(?!Source)\w+ *\w+ + \S+', line)
            if anti_spec:
                stripped_list = anti_spec[0].strip().split()
                anti = stripped_list[-1]
                spec = ' '.join(stripped_list[:-1])

Output 输出量

Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK,DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY,GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
01/04/15|1800|MAN,OLD|HH0000000003|ESCHERICHIA COLI|URINE-CLEAN VOIDED SPEC|LEVOFLOXACIN

Edit: 编辑:
Obviously, the variables should be reset to some dummy value between writes on case of a corrupt record. 显然,在记录损坏的情况下,两次写入之间应将变量重置为某个伪值。 Also, if there is no line of dashes after the last record it won't get written as it stands. 另外,如果最后一条记录后没有破折号,则不会按原样写入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM