简体   繁体   中英

Parsing text with Python 2.7

Text File

• I.D.: AN000015544 
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER 
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398 
• I.D.: AN000016955 
DESCRIPTION: TEMPERATURE CALIBRATOR 
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063 
• I.D.: AN000017259 
DESCRIPTION: TRUE RMS MULTIMETER 
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076 
• I.D.: AN000032766                         
DESCRIPTION: TRUE RMS MULTIMETER                            
MANUFACTURER: AGILENT MODEL NUM.: U1253B CALIBRATION    -   DUE DATE:6/1/2016   SERIAL  NUMBER: MY5048  9036

Objective

Seeking to find a more efficient algorithm for parsing the manufacturer name and number. ie 'HEWLETT-PACKARDMODEL NUM.: 34401A', 'AGILENT MODEL NUM.: U1253B'...etc. from the text file above.

Data Structure

parts_data = {'Model_Number': []}

Code

with open("textfile", 'r') as parts_info:
    linearray = parts_info.readlines(
    for line in linearray:
        model_number = ''
        model_name = ''
        if "MANUFACTURER:" in line:
            model_name = line.split(':')[1]
        if "NUM.:" in line:
            model_number = line.split(':')[2]
            model_number = model_number.split()[0]
            model_number = model_name + ' ' + model_number
            parts_data['Model_Number'].append(model_number.rstrip())

My code does exactly what I want, but I think there is a faster or cleaner way to complete the action.Let's increase efficiency!

Your code looks fine already and unless you're parsing more than GB's of data I don't know what the point of this is. I thought of a few things.

If you remove the linearray = parts_info.readlines( line Python understands just using a for loop with an open file so that'd make this whole thing streaming in case your file's huge. Currently that line of code will try reading the entire file into memory at once, rather than going line by line, so you'll crash your computer if you have a file bigger than your memory.

You can also combine the if statements and do 1 conditional since you seem to only care about having both fields. In the interest of cleaner code you also don't need model_number = ''; model_name = '' model_number = ''; model_name = ''

Saving the results of things like line.split(':') can help.

Alternatively, you could try a regex. It's impossible to tell which one is going to perform better without testing both, which brings me back to what I was saying in the beginning: optimizing code is tricky and really shouldn't be done if not necessary. If you really, really cared about efficiency you would use a program like awk written in C.

One straight way is using regex :

with open("textfile", 'r') as parts_info:
     for line in parts_info:
          m=re.search(r'[A-Z ]+ NUM\.: [A-Z\d]+',line)
          if m:
                print m.group(0)

result :

'PACKARDMODEL NUM.: 34401A', 
' FLUKE MODEL NUM.: 724', 
' AGILENT MODEL NUM.: U1253A', 
' AGILENT MODEL NUM.: U1253B'

A few things are coming to my mind :

  • You could do the split(':') once and reuse it
  • if number of : is always the same then throw away the ifs and check with the length once

I am finishing with something like this

parts_data = {'Model_Number': []}
with open("textfile.txt", 'r') as parts_info:
    linearray = parts_info.readlines()

for line in linearray:
    linesp = line.split(':')
    if len(linesp)>2:
        model_name = linesp[1]
        model_number = linesp[2]
        model_number = model_number.split()[0]
        model_number = model_name + ' ' + model_number
        parts_data['Model_Number'].append(model_number.rstrip())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM