Using regex to extract two elements from txt file and rename (python)

Question

I'm trying to rename a bunch of payslip txt files i python using regex. The elements that I want to use for this are personnummer (social security number) and datum (date). Personnummer is formatted like this \d\d\d\d\d\d-\d\d\d\d and works fine by itself using the code below.

But when i try to add datum as well as personnummer , which is formatted like this GFROM:\d\d\d\d\d\d\d\d (i only want the numbers, not the GFROM part) I run into a syntax error.

Do you have any suggestions? I've looked through the previous posts but haven't really found anything there.
Many thanks in advance.

/Andrew

import os
import re

mydir = 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov'
personnummer = "(\d\d\d\d\d\d\-\d\d\d\d)"
datum = "(GFROM:(\d\d\d\d\d\d\d\d))"

for arch in os.listdir(mydir):
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
        txt = f.read()
    s = re.search(personnummer, txt)
    t = re.search(datum, txt)

    name = '19' + s.group() + '  ' + '20' + t.group() + ' Matrikelkort'+ '.txt'
    newpath = os.path.join(mydir, name)
    os.rename(archpath, newpath)```


**The input files look like this;**

                                 DATUM: 010122               KUND:20290  
XXX KOMMUN              SIDA:    23   70677
PERSONS NAME                                UTB-KOD                                  ANS.DAT:                               010206-3008


                                  BOK/                  G T       ARBETS-   ARB   ARB   L L P B BRUT L                   FAST
GÄLLER GÄLLER AVG        LÖP AV   CAK/   BEFATTNINGS    R Y ANST  TIDS      TID   TID   P G L L AVDR K BLPP       BELOPP LÖNE   UPP DEL
FR O M T O M  KOD FÖR DB NR  TAL  BSK    -BENÄMNING     P P FORM  VILLKOR   %     HEL   L R G G FROM L FROM FIP*A lÖN    TIML   OMF PEN
----------------------------------------------------------------------------------------------------------------------------------------
760701 790630 110  83 20 5070LOK         HEMSAMARIT     5 1 4     10004000              Ö              7607 000000   800 000000
790701 800108 970  76 21 5017ANA-T       HEMSAMARIT     5T1 4     00004000              K            077907 000000000000 000000
KUNDNR:20290     SIDA:   023     70677    GFROM:19760701    GTOM:19800108           PERSONS NAME                            010206-3008
000001L 2   000001010122 33399CMT011MATRIKELKORT        Matrikelkort        000001CMZ029050330-7118 01-01-22    CMZ02901
                                                                                   120290                                   

**The errors i got**

runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')
Traceback (most recent call last):

  File "<ipython-input-21-f7cd01adb9a3>", line 1, in <module>
    runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 
line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 
line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 24, in <module>
    os.rename(archpath, newpath)

OSError: [WinError 123] Incorrect syntax for file name, 
directory name or volume label: 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\File17.txt' -> 
'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\010206-3008  20GFROM:19760701 Matrikelkort.txt'


**Update: When i removed the ':' from GFROM i get the following error**

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 22, in <module>
    name = '19' + s.group() + '  ' + '20' + t.group() + ' Matrikelkort'+ '.txt'

AttributeError: 'NoneType' object has no attribute 'group'

Answer 1

Here is a snippet you could try:

import os
import re

rx_num = re.compile(r"\s(\d{6}-\d{4})\s", re.M)
rx_dat = re.compile("GFROM:(\d\d\d\d\d\d\d\d)\s", re.M)

for arch in os.listdir(mydir):
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
        txt = f.read()

    s_match = rx_num.search(txt)
    s = s_match.group() if s_match is not None else "[Missing]"

    t_match = rx_dat.search(txt)
    t = t_match.group() if t_match is not None else "[Missing]"

    name = '19' + s + '  ' + '20' + t + ' Matrikelkort'+ '.txt'

    newpath = os.path.join(mydir, name)
    os.rename(archpath, newpath)

The use of compile is optional, but I find it clearer. I also added the re.M which is the flag for 'Multiline'. Lastly, I added those \s before and after the groups to ensure a string like 'abd123456-7890def' would not match. Also, keep in mind that you will onsly get the first match with this code. If you want every match, try using findall instead.

Using regex to extract two elements from txt file and rename (python)

Question

1 answers

solution1
0 ACCPTED 2019-10-03 10:17:53

Using regex to extract two elements from txt file and rename (python)

Question

1 answers

solution1 0 ACCPTED 2019-10-03 10:17:53

solution1
0 ACCPTED 2019-10-03 10:17:53