简体   繁体   English

使用带有python的正则表达式来查找唯一的数字格式

[英]Using regex with python to find unique number format

I have a very large text file (45000 lines) with ID #'s in the format of 4 or 5 numbers followed by 2 or 3 numbers and then a letter sometimes trailing at the end. 我有一个非常大的文本文件(45000行),ID#的格式为4或5个数字,后跟2或3个数字,然后有一个字母有时落在最后。

Sample example formats: 示例示例格式:

XXXX-XX XXXXX-XX ,XXXXX-XXw, XXXXX-XXw, XXXXX-XXww XXXX-XX XXXXX-XX,XXXXX-XXw,XXXXX-XXw,XXXXX-XXww

((Where w is a letter and X is a number)) ((其中w是字母,X是数字))

Most of the values are in the format of #####-## or ####-##, but a large chunk have 1 or more letter trailing at the end. 大多数值的格式为##### - ##或#### - ##,但是大块的末尾有一个或多个字母。

What I want to do: Whenever there is a value that has a letter at the end I want to store it in a dictionary and keep track of all the unique values of letters that diverge from the normal format, and then print that dictionary. 我想做的事情:每当有一个字母末尾的值我希望将它存储在字典中并跟踪与正常格式不同的字母的所有唯一值,然后打印该字典。

So for values like: 11111-12s or 1111-12a or 11234-24b I want to store the letter values (s, a , b) and see the differences. 因此,对于像:11111-12s或1111-12a或11234-24b这样的值,我想存储字母值(s,a,b)并查看差异。 What I have currently just displays the values and also repeats: 我目前只显示值并重复:

import re

sampleFile = open("Sample.txt", "r")

#regEX formats
sample = re.compile(r'(\d{4,5}-\d\d\w{1,4})')

for line in sampleFile:
    sampleNum = sample.findall(line)
    for word in sampleNum:
        print word

How would I go about doing this targeting unique values of the w{1,4} portion of the regex and storing them in a dict? 我将如何针对正则表达式的w {1,4}部分的唯一值并将其存储在dict中?

EDIT:When I run above this is a sample of numbers I get: 编辑:当我在上面运行时,这是我得到的数字样本:

12647-01a 12627-02R 12606-01a 12588-02a 12583-01S 12583-01R 12647-01a 12627-02R 12606-01a 12588-02a 12583-01S 12583-01R

So those values at the end vary, and I just want to store the end letter (sometimes there is 2 or more) in a dict or set. 所以最后那些值会有所不同,我只想在字典或集合中存储结束字母(有时候有2个或更多)。 Hope this helps 希望这可以帮助

A simple set which reads your regex should match your clarified comment: 读取正则表达式的简单集应符合您澄清的注释:

import re
uniq = set()
with open('Sample.txt') as fin:
    for line in fin:
        ma = re.search(r'(\w{1,4})$', line)
        if not ma:
            continue
        uniq.add(ma.group(1))

print(uniq)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM