简体   繁体   English

如何剪切部分文本并用Python和RegEx替换每一行

[英]How to cut part of the text and replace each line with Python and RegEx

Hello, I'm a complete beginner with Python and just started learning it and using RegEx for text manipulation. 您好,我是Python的完整入门者,刚刚开始学习它并使用RegEx进行文本操作。 I am sorry in advance if i had broken some rules of StackOverflow 如果我违反了StackOverflow的一些规则,请多提前

I am making a script in Python where i would take (cut) date and time from first line and replace "Date" "TimeWindowStart" and TimeWindowEnd" on each line 我正在Python中制作一个脚本,该脚本从第一行开始(截取)日期和时间,并在每一行上替换“ Date”,“ TimeWindowStart”和“ TimeWindowEnd”

ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000

I know how to select with regex date 我知道如何选择正则表达式日期

([0-9][0-9]|2[0-9])/[0-9][0-9](/[0-9][0-9][0-9][0-9])?

And how to select time 以及如何选择时间

([0-9][0-9]|2[0-9]):[0-9][0-9](:[0-9][0-9])?

But im stuck with how to select part of the text copy it and then find text which i want to replace with re.sub function 但是我坚持如何选择文本的一部分复制它,然后找到我想用re.sub函数替换的文本

so final output would look like this: 因此最终输出将如下所示:

ReportDate=, TimeWindowStart=, TimeWindowEnd=

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000 
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000 
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000 
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

This is a partial answer, because I don't know the Python APIs for manipulating text files particularly well. 这是部分答案,因为我不太了解用于处理文本文件的Python API。 You may read the first line of the file, and extract out the values for the report date, and start/end window times. 您可以阅读文件的第一行,并提取报告日期和开始/结束窗口时间的值。

first = "ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59"
ReportDate = re.sub(r'ReportDate=([^,]+),.*', '\\1', first)
TimeWindowStart = re.sub(r'.*TimeWindowStart=([^,]+),.*', '\\1', first)
TimeWindowEnd = re.sub(r'.*TimeWindowEnd=(.*)', '\\1', first)

Write out the first line with the values for the three variables removed. 写出第一行,删除三个变量的值。

Then, all you need to do is read in each subsequent line and do the following replacements: 然后,您需要做的就是在每一行中阅读并进行以下替换:

line = "Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000"
line = re.sub(r'\bDate\b', ReportDate, line)
line = re.sub(r'\b TimeWindowStart\b', TimeWindowStart, line)
line = re.sub(r'\ TimeWindowEnd\b', TimeWindowEnd, line)

After processing each line in this way, you may write it to the output file. 在以这种方式处理每一行之后,您可以将其写入输出文件。

first thing, you can specify a quantifier in regex queries, so if you want 4 numbers you don't need [0-9][0-9][0-9][0-9] but you can do with [0-9]{4} . 首先,您可以在正则表达式查询中指定一个量词,因此,如果您需要4个数字,则不需要[0-9][0-9][0-9][0-9]但是可以使用[0-9]{4} To capture an expression you wrap it in round brackets value=([0-9]{4}) will give you only the numbers 要捕获表达式,请将其包装在方括号中value=([0-9]{4})将只为您提供数字

If you want to use re.sub you just need to give it a patter, a replacement string and your input string, eg re.sub(pattern, replacement, string) 如果要使用re.sub ,则只需为其提供一个模式,一个替换字符串和您的输入字符串,例如re.sub(pattern, replacement, string)

Therefore: 因此:

import re

txt = """ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
"""

pattern_date = 'ReportDate=([0-9]{2}/[0-9]{2}/[0-9]{4})'
report_date = re.findall(pattern_date, txt)[0]

pattern_time_start = 'TimeWindowStart=([0-9]{2}:[0-9]{2}:[0-9]{2})'
start_time = re.findall(pattern_time_start, txt)[0]

pattern_time_end = 'TimeWindowEnd=([0-9]{2}:[0-9]{2}:[0-9]{2})'
end_time = re.findall(pattern_time_end, txt)[0]

splitted = txt.split('\n')  # Split the txt so that we skip the first line

txt2 = '\n'.join(splitted[1:])  # text to perform the sub 

# substitution of your values
txt2 = re.sub('Date', report_date, txt2)
txt2 = re.sub('TimeWindowStart', start_time, txt2)
txt2 = re.sub('TimeWindowEnd', end_time, txt2)

txt_final = splitted[0] + '\n' + txt2
print(txt_final)

Output: 输出:

ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

Try this, 尝试这个,

import re

#Open file and read line by line
with open("a") as file:
 # Get and process first line
 first_line = file.readline()
 m = re.search("ReportDate=(?P<ReportDate>[0-9/]+), TimeWindowStart=(?P<TimeWindowStart>[0-9:]+), TimeWindowEnd=(?P<TimeWindowEnd>[0-9:]+)",first_line)
 first_line= re.sub(m.group('ReportDate'), "", first_line)
 first_line= re.sub(m.group('TimeWindowStart'), "", first_line)
 first_line= re.sub(m.group('TimeWindowEnd'), "", first_line)
 print(first_line)

 # Process rest of the lines
 for line in file:
    line = re.sub(r'\bDate\b', m.group('ReportDate'), line)
    line = re.sub(r'\bTimeWindowStart\b', m.group('TimeWindowStart'), line)
    line = re.sub(r'\bTimeWindowEnd\b', m.group('TimeWindowEnd'), line)
    print(line.rstrip())

Output: 输出:

ReportDate=, TimeWindowStart=, TimeWindowEnd=

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

Here is my code: 这是我的代码:

import re

s = """ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000"""

datereg = r'(\d{2}/\d{2}/\d{4})'
timereg = r'(\d{2}:\d{2}:\d{2})'

dates = re.findall(datereg, s)
times = re.findall(timereg, s)

# replacing one thing at a time
result = re.sub(r'\bDate\b', dates[0],
            re.sub(r'\bTimeWindowEnd\b,', times[1] + ',',
                re.sub(r'\bTimeWindowStart\b,', times[0] + ',',
                    re.sub(timereg, '', 
                        re.sub(datereg, '', s)))))

print(result)

Output: 输出:

ReportDate=, TimeWindowStart=, TimeWindowEnd=

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

Find a clear solution represented below: 找到一个清晰的解决方案,如下所示:

import re

input_str = """
ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
"""

# Divide input string into two parts: header, body
header = input_str.split('\n')[1]
body = '\n'.join(input_str.split('\n')[2:])

# Find elements to be replaced
ri = re.findall('\d{2}/\d{2}/\d{4}',header)
ri.extend(re.findall('\d{2}:\d{2}:\d{2}',header))

# Replace elements
new_header = header.replace(ri[0],'')\
                   .replace(ri[1],'')\
                   .replace(ri[2],'')

new_body = body.replace('Date',ri[0])\
               .replace('TimeWindowStart',ri[1])\
               .replace('TimeWindowEnd',ri[2])

# Construct the result string
full_string = new_header + '\n\n' + new_body

Just find the items to be replaced with regex and perform an ordinary string replace. 只需找到要用正则表达式替换的项目并执行普通的字符串替换即可。 I think it's effective until you have only few elements. 我认为这是有效的,直到您只有很少的要素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM