简体   繁体   English

使用Python 2.7解析文本

[英]Parsing text with Python 2.7

Text File 文本文件

• I.D.: AN000015544 
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER 
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398 
• I.D.: AN000016955 
DESCRIPTION: TEMPERATURE CALIBRATOR 
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063 
• I.D.: AN000017259 
DESCRIPTION: TRUE RMS MULTIMETER 
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076 
• I.D.: AN000032766                         
DESCRIPTION: TRUE RMS MULTIMETER                            
MANUFACTURER: AGILENT MODEL NUM.: U1253B CALIBRATION    -   DUE DATE:6/1/2016   SERIAL  NUMBER: MY5048  9036

Objective 目的

Seeking to find a more efficient algorithm for parsing the manufacturer name and number. 寻求更有效的算法来解析制造商名称和编号。 ie 'HEWLETT-PACKARDMODEL NUM.: 34401A', 'AGILENT MODEL NUM.: U1253B'...etc. 即'HEWLETT-PACKARDMODEL NUM。:34401A','AGILENT MODEL NUM。:U1253B'......等。 from the text file above. 从上面的文本文件。

Data Structure 数据结构

parts_data = {'Model_Number': []}

Code

with open("textfile", 'r') as parts_info:
    linearray = parts_info.readlines(
    for line in linearray:
        model_number = ''
        model_name = ''
        if "MANUFACTURER:" in line:
            model_name = line.split(':')[1]
        if "NUM.:" in line:
            model_number = line.split(':')[2]
            model_number = model_number.split()[0]
            model_number = model_name + ' ' + model_number
            parts_data['Model_Number'].append(model_number.rstrip())

My code does exactly what I want, but I think there is a faster or cleaner way to complete the action.Let's increase efficiency! 我的代码正是我想要的,但我认为有更快或更简洁的方法来完成动作。让我们提高效率!

Your code looks fine already and unless you're parsing more than GB's of data I don't know what the point of this is. 你的代码看起来很好,除非你解析的数据超过GB,否则我不知道这是什么意思。 I thought of a few things. 我想到了一些事情。

If you remove the linearray = parts_info.readlines( line Python understands just using a for loop with an open file so that'd make this whole thing streaming in case your file's huge. Currently that line of code will try reading the entire file into memory at once, rather than going line by line, so you'll crash your computer if you have a file bigger than your memory. 如果删除linearray = parts_info.readlines(行Python的理解只使用一个for循环以开放的文件,这样会做这件事的情况下, 流媒体文件巨大,目前该行代码会试图读取整个文件到内存一次,而不是逐行,所以如果你的文件大于你的记忆,你将崩溃您的计算机。

You can also combine the if statements and do 1 conditional since you seem to only care about having both fields. 您也可以组合if语句并执行1条件,因为您似乎只关心这两个字段。 In the interest of cleaner code you also don't need model_number = ''; model_name = '' 为了更清洁的代码,你也不需要model_number = ''; model_name = '' model_number = ''; model_name = ''

Saving the results of things like line.split(':') can help. 保存line.split(':')类的结果可以提供帮助。

Alternatively, you could try a regex. 或者,您可以尝试使用正则表达式。 It's impossible to tell which one is going to perform better without testing both, which brings me back to what I was saying in the beginning: optimizing code is tricky and really shouldn't be done if not necessary. 如果不对两者进行测试,就无法确定哪一个会表现得更好,这让我回到了开头所说的内容:优化代码很棘手,如果没有必要,真的不应该这样做。 If you really, really cared about efficiency you would use a program like awk written in C. 如果你真的非常关心效率,你会使用像用C语言写的awk这样的程序。

One straight way is using regex : 一种直接的方法是使用正则表达式:

with open("textfile", 'r') as parts_info:
     for line in parts_info:
          m=re.search(r'[A-Z ]+ NUM\.: [A-Z\d]+',line)
          if m:
                print m.group(0)

result : 结果:

'PACKARDMODEL NUM.: 34401A', 
' FLUKE MODEL NUM.: 724', 
' AGILENT MODEL NUM.: U1253A', 
' AGILENT MODEL NUM.: U1253B'

A few things are coming to my mind : 我想到了一些事情:

  • You could do the split(':') once and reuse it 您可以执行split(':')一次并重复使用它
  • if number of : is always the same then throw away the ifs and check with the length once 如果数:始终是相同的,然后扔掉IFS与长度检查一次

I am finishing with something like this 我正在完成这样的事情

parts_data = {'Model_Number': []}
with open("textfile.txt", 'r') as parts_info:
    linearray = parts_info.readlines()

for line in linearray:
    linesp = line.split(':')
    if len(linesp)>2:
        model_name = linesp[1]
        model_number = linesp[2]
        model_number = model_number.split()[0]
        model_number = model_name + ' ' + model_number
        parts_data['Model_Number'].append(model_number.rstrip())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM