[英]Using regex to extract two elements from txt file and rename (python)
I'm trying to rename a bunch of payslip txt files i python using regex.我正在尝试使用正则表达式重命名 python 中的一堆工资单 txt 文件。 The elements that I want to use for this are personnummer (social security number) and datum (date).我想为此使用的元素是personnummer (社会安全号码)和datum (日期)。 Personnummer is formatted like this \d\d\d\d\d\d-\d\d\d\d and works fine by itself using the code below. Personnummer的格式如下 \d\d\d\d\d\d-\d\d\d\d 并使用下面的代码自行正常工作。
But when i try to add datum as well as personnummer , which is formatted like this GFROM:\d\d\d\d\d\d\d\d (i only want the numbers, not the GFROM part) I run into a syntax error.但是当我尝试添加datum和personnummer时,它的格式如下 GFROM:\d\d\d\d\d\d\d\d (我只想要数字,而不是 GFROM 部分)我遇到语法错误。
Do you have any suggestions?你有什么建议吗? I've looked through the previous posts but haven't really found anything there.我浏览了以前的帖子,但没有真正找到任何东西。
Many thanks in advance.提前谢谢了。
/Andrew /安德鲁
import os
import re
mydir = 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov'
personnummer = "(\d\d\d\d\d\d\-\d\d\d\d)"
datum = "(GFROM:(\d\d\d\d\d\d\d\d))"
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(personnummer, txt)
t = re.search(datum, txt)
name = '19' + s.group() + ' ' + '20' + t.group() + ' Matrikelkort'+ '.txt'
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)```
**The input files look like this;**
DATUM: 010122 KUND:20290
XXX KOMMUN SIDA: 23 70677
PERSONS NAME UTB-KOD ANS.DAT: 010206-3008
BOK/ G T ARBETS- ARB ARB L L P B BRUT L FAST
GÄLLER GÄLLER AVG LÖP AV CAK/ BEFATTNINGS R Y ANST TIDS TID TID P G L L AVDR K BLPP BELOPP LÖNE UPP DEL
FR O M T O M KOD FÖR DB NR TAL BSK -BENÄMNING P P FORM VILLKOR % HEL L R G G FROM L FROM FIP*A lÖN TIML OMF PEN
----------------------------------------------------------------------------------------------------------------------------------------
760701 790630 110 83 20 5070LOK HEMSAMARIT 5 1 4 10004000 Ö 7607 000000 800 000000
790701 800108 970 76 21 5017ANA-T HEMSAMARIT 5T1 4 00004000 K 077907 000000000000 000000
KUNDNR:20290 SIDA: 023 70677 GFROM:19760701 GTOM:19800108 PERSONS NAME 010206-3008
000001L 2 000001010122 33399CMT011MATRIKELKORT Matrikelkort 000001CMZ029050330-7118 01-01-22 CMZ02901
120290
**The errors i got**
runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')
Traceback (most recent call last):
File "<ipython-input-21-f7cd01adb9a3>", line 1, in <module>
runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py",
line 827, in runfile
execfile(filename, namespace)
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py",
line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 24, in <module>
os.rename(archpath, newpath)
OSError: [WinError 123] Incorrect syntax for file name,
directory name or volume label: 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\File17.txt' ->
'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\010206-3008 20GFROM:19760701 Matrikelkort.txt'
**Update: When i removed the ':' from GFROM i get the following error**
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 22, in <module>
name = '19' + s.group() + ' ' + '20' + t.group() + ' Matrikelkort'+ '.txt'
AttributeError: 'NoneType' object has no attribute 'group'
Here is a snippet you could try:这是您可以尝试的片段:
import os
import re
rx_num = re.compile(r"\s(\d{6}-\d{4})\s", re.M)
rx_dat = re.compile("GFROM:(\d\d\d\d\d\d\d\d)\s", re.M)
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s_match = rx_num.search(txt)
s = s_match.group() if s_match is not None else "[Missing]"
t_match = rx_dat.search(txt)
t = t_match.group() if t_match is not None else "[Missing]"
name = '19' + s + ' ' + '20' + t + ' Matrikelkort'+ '.txt'
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)
The use of
compile
is optional, but I find it clearer.compile
的使用是可选的,但我发现它更清晰。 I also added there.M
which is the flag for 'Multiline'.我还添加了re.M
,它是“Multiline”的标志。 Lastly, I added those\s
before and after the groups to ensure a string like 'abd123456-7890def' would not match.最后,我在组之前和之后添加了那些\s
以确保像 'abd123456-7890def' 这样的字符串不匹配。 Also, keep in mind that you will onsly get the first match with this code.另外,请记住,您只会获得与此代码匹配的第一个匹配项。 If you want every match, try using findall instead.如果您想要每场比赛,请尝试使用findall 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.