簡體   English   中英

使用正則表達式從txt文件中提取兩個元素並重命名(python)

[英]Using regex to extract two elements from txt file and rename (python)

我正在嘗試使用正則表達式重命名 python 中的一堆工資單 txt 文件。 我想為此使用的元素是personnummer (社會安全號碼)和datum (日期)。 Personnummer的格式如下 \d\d\d\d\d\d-\d\d\d\d 並使用下面的代碼自行正常工作。

但是當我嘗試添加datumpersonnummer時,它的格式如下 GFROM:\d\d\d\d\d\d\d\d (我只想要數字,而不是 GFROM 部分)我遇到語法錯誤。

你有什么建議嗎? 我瀏覽了以前的帖子,但沒有真正找到任何東西。
提前謝謝了。

/安德魯

import os
import re

mydir = 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov'
personnummer = "(\d\d\d\d\d\d\-\d\d\d\d)"
datum = "(GFROM:(\d\d\d\d\d\d\d\d))"

for arch in os.listdir(mydir):
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
        txt = f.read()
    s = re.search(personnummer, txt)
    t = re.search(datum, txt)

    name = '19' + s.group() + '  ' + '20' + t.group() + ' Matrikelkort'+ '.txt'
    newpath = os.path.join(mydir, name)
    os.rename(archpath, newpath)```


**The input files look like this;**

                                 DATUM: 010122               KUND:20290  
XXX KOMMUN              SIDA:    23   70677
PERSONS NAME                                UTB-KOD                                  ANS.DAT:                               010206-3008


                                  BOK/                  G T       ARBETS-   ARB   ARB   L L P B BRUT L                   FAST
GÄLLER GÄLLER AVG        LÖP AV   CAK/   BEFATTNINGS    R Y ANST  TIDS      TID   TID   P G L L AVDR K BLPP       BELOPP LÖNE   UPP DEL
FR O M T O M  KOD FÖR DB NR  TAL  BSK    -BENÄMNING     P P FORM  VILLKOR   %     HEL   L R G G FROM L FROM FIP*A lÖN    TIML   OMF PEN
----------------------------------------------------------------------------------------------------------------------------------------
760701 790630 110  83 20 5070LOK         HEMSAMARIT     5 1 4     10004000              Ö              7607 000000   800 000000
790701 800108 970  76 21 5017ANA-T       HEMSAMARIT     5T1 4     00004000              K            077907 000000000000 000000
KUNDNR:20290     SIDA:   023     70677    GFROM:19760701    GTOM:19800108           PERSONS NAME                            010206-3008
000001L 2   000001010122 33399CMT011MATRIKELKORT        Matrikelkort        000001CMZ029050330-7118 01-01-22    CMZ02901
                                                                                   120290                                   

**The errors i got**

runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')
Traceback (most recent call last):

  File "<ipython-input-21-f7cd01adb9a3>", line 1, in <module>
    runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 
line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 
line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 24, in <module>
    os.rename(archpath, newpath)

OSError: [WinError 123] Incorrect syntax for file name, 
directory name or volume label: 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\File17.txt' -> 
'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\010206-3008  20GFROM:19760701 Matrikelkort.txt'


**Update: When i removed the ':' from GFROM i get the following error**

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 22, in <module>
    name = '19' + s.group() + '  ' + '20' + t.group() + ' Matrikelkort'+ '.txt'

AttributeError: 'NoneType' object has no attribute 'group'

這是您可以嘗試的片段:

import os
import re

rx_num = re.compile(r"\s(\d{6}-\d{4})\s", re.M)
rx_dat = re.compile("GFROM:(\d\d\d\d\d\d\d\d)\s", re.M)

for arch in os.listdir(mydir):
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
        txt = f.read()

    s_match = rx_num.search(txt)
    s = s_match.group() if s_match is not None else "[Missing]"

    t_match = rx_dat.search(txt)
    t = t_match.group() if t_match is not None else "[Missing]"

    name = '19' + s + '  ' + '20' + t + ' Matrikelkort'+ '.txt'

    newpath = os.path.join(mydir, name)
    os.rename(archpath, newpath)

compile的使用是可選的,但我發現它更清晰。 我還添加了re.M ,它是“Multiline”的標志。 最后,我在組之前和之后添加了那些\s以確保像 'abd123456-7890def' 這樣的字符串不匹配。 另外,請記住,您只會獲得與此代碼匹配的第一個匹配項。 如果您想要每場比賽,請嘗試使用findall

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM