简体   繁体   English

使用正则表达式从txt文件中提取两个元素并重命名(python)

[英]Using regex to extract two elements from txt file and rename (python)

I'm trying to rename a bunch of payslip txt files i python using regex.我正在尝试使用正则表达式重命名 python 中的一堆工资单 txt 文件。 The elements that I want to use for this are personnummer (social security number) and datum (date).我想为此使用的元素是personnummer (社会安全号码)和datum (日期)。 Personnummer is formatted like this \d\d\d\d\d\d-\d\d\d\d and works fine by itself using the code below. Personnummer的格式如下 \d\d\d\d\d\d-\d\d\d\d 并使用下面的代码自行正常工作。

But when i try to add datum as well as personnummer , which is formatted like this GFROM:\d\d\d\d\d\d\d\d (i only want the numbers, not the GFROM part) I run into a syntax error.但是当我尝试添加datumpersonnummer时,它的格式如下 GFROM:\d\d\d\d\d\d\d\d (我只想要数字,而不是 GFROM 部分)我遇到语法错误。

Do you have any suggestions?你有什么建议吗? I've looked through the previous posts but haven't really found anything there.我浏览了以前的帖子,但没有真正找到任何东西。
Many thanks in advance.提前谢谢了。

/Andrew /安德鲁

import os
import re

mydir = 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov'
personnummer = "(\d\d\d\d\d\d\-\d\d\d\d)"
datum = "(GFROM:(\d\d\d\d\d\d\d\d))"

for arch in os.listdir(mydir):
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
        txt = f.read()
    s = re.search(personnummer, txt)
    t = re.search(datum, txt)

    name = '19' + s.group() + '  ' + '20' + t.group() + ' Matrikelkort'+ '.txt'
    newpath = os.path.join(mydir, name)
    os.rename(archpath, newpath)```


**The input files look like this;**

                                 DATUM: 010122               KUND:20290  
XXX KOMMUN              SIDA:    23   70677
PERSONS NAME                                UTB-KOD                                  ANS.DAT:                               010206-3008


                                  BOK/                  G T       ARBETS-   ARB   ARB   L L P B BRUT L                   FAST
GÄLLER GÄLLER AVG        LÖP AV   CAK/   BEFATTNINGS    R Y ANST  TIDS      TID   TID   P G L L AVDR K BLPP       BELOPP LÖNE   UPP DEL
FR O M T O M  KOD FÖR DB NR  TAL  BSK    -BENÄMNING     P P FORM  VILLKOR   %     HEL   L R G G FROM L FROM FIP*A lÖN    TIML   OMF PEN
----------------------------------------------------------------------------------------------------------------------------------------
760701 790630 110  83 20 5070LOK         HEMSAMARIT     5 1 4     10004000              Ö              7607 000000   800 000000
790701 800108 970  76 21 5017ANA-T       HEMSAMARIT     5T1 4     00004000              K            077907 000000000000 000000
KUNDNR:20290     SIDA:   023     70677    GFROM:19760701    GTOM:19800108           PERSONS NAME                            010206-3008
000001L 2   000001010122 33399CMT011MATRIKELKORT        Matrikelkort        000001CMZ029050330-7118 01-01-22    CMZ02901
                                                                                   120290                                   

**The errors i got**

runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')
Traceback (most recent call last):

  File "<ipython-input-21-f7cd01adb9a3>", line 1, in <module>
    runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 
line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 
line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 24, in <module>
    os.rename(archpath, newpath)

OSError: [WinError 123] Incorrect syntax for file name, 
directory name or volume label: 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\File17.txt' -> 
'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\010206-3008  20GFROM:19760701 Matrikelkort.txt'


**Update: When i removed the ':' from GFROM i get the following error**

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 22, in <module>
    name = '19' + s.group() + '  ' + '20' + t.group() + ' Matrikelkort'+ '.txt'

AttributeError: 'NoneType' object has no attribute 'group'

Here is a snippet you could try:这是您可以尝试的片段:

import os
import re

rx_num = re.compile(r"\s(\d{6}-\d{4})\s", re.M)
rx_dat = re.compile("GFROM:(\d\d\d\d\d\d\d\d)\s", re.M)

for arch in os.listdir(mydir):
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
        txt = f.read()

    s_match = rx_num.search(txt)
    s = s_match.group() if s_match is not None else "[Missing]"

    t_match = rx_dat.search(txt)
    t = t_match.group() if t_match is not None else "[Missing]"

    name = '19' + s + '  ' + '20' + t + ' Matrikelkort'+ '.txt'

    newpath = os.path.join(mydir, name)
    os.rename(archpath, newpath)

The use of compile is optional, but I find it clearer. compile的使用是可选的,但我发现它更清晰。 I also added the re.M which is the flag for 'Multiline'.我还添加了re.M ,它是“Multiline”的标志。 Lastly, I added those \s before and after the groups to ensure a string like 'abd123456-7890def' would not match.最后,我在组之前和之后添加了那些\s以确保像 'abd123456-7890def' 这样的字符串不匹配。 Also, keep in mind that you will onsly get the first match with this code.另外,请记住,您只会获得与此代码匹配的第一个匹配项。 If you want every match, try using findall instead.如果您想要每场比赛,请尝试使用findall

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM