简体   繁体   English

Python多个正则表达式替换

[英]Python multiple regular expression replace

I'm a python newbie. 我是一个蟒蛇新手。 I've been searching days long, but found only some little bits of my conception. 我一直在寻找,但发现我的观念只有一点点。 Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows). Windows上的Python 2.7(我选择了python,因为它是多平台,结果可以在Windows上移植)。

I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. 我想创建一个脚本,在文件夹中搜索* .txt UTF-8文本文件,加载内容(一个文件在彼此之后),将非ascii字符更改为html权限,然后在开始时添加html标记并且在每一行的末尾,但是标签的两个变体,一个用于文件的头部,一个用于文件的尾部,其中(头尾)由空行分隔。 After that, all the result have to be written out to another text file(s), like *.htm. 之后,所有结果必须写入另一个文本文件,如* .htm。 To be visual: 要视觉:

unicode1.txt: unicode1.txt:

űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line4

result have to be in unicode1.htm: 结果必须在unicode1.htm中:

<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line1</p>
<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line2</p>
[empty line]
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>

I started to develop the core of my solution, but I stucked. 我开始开发我的解决方案的核心,但我坚持了下来。 See script versions (for simplicity I chose encode with xmlcharrefreplace). 请参阅脚本版本(为简单起见,我选择使用xmlcharrefreplace进行编码)。

V1: V1:

import re, cgi, fileinput
file="_utf8.txt"
text=""
for line in fileinput.input(file, inplace=0):
  line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
print text

It worked, good result, but for this task fileinput is not a usable way I think. 它工作得很好,但是对于这个任务来说,fileinput不是我认为可行的方式。

V2: V2:

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
f.close()
print text

It messed up the result, closing tag at line start replacing first letter, etc. 它搞砸了结果,在行开始时关闭标记替换第一个字母等。

V3 (tried multiline flag): V3(试过多行标志):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1, flags=re.M)
  text=text+re.sub(r"$", "</p>", line, 1, flags=re.M)
f.close()
print text

Same result. 结果相同。

V4 (tried 1 regex instead of 2): V4(试过1个正则表达式而不是2个):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
f.close()
print text

Same result. 结果相同。 Please help. 请帮忙。

Edit: I just checked the result file with a hexeditor, and there is an x0D byte before each closing tag! 编辑:我刚用hexeditor检查结果文件,每个结束标记都有一个x0D字节! Why? 为什么?

Edit2: changes for a more logical approach Edit2:更改为更合理的方法

text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)

Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. 编辑3:使用hexeditor我看到了混乱结果的原因:每个CRLF之前的额外CR(x0D)字节。 I tracked down the CR problem, what made that: the concatenation with + 我追踪了CR问题,是什么原因:用+连接

# -*- coding: utf-8 -*-
text=""
f=u"unicode text line1\r\n unicode text line2"
for line in f:
  text+=line
print text

This results in: 这导致:

unicode text line1\r\r\n unicode text line2

Any idea, how to fix this? 任何想法,如何解决这个问题?

There's no need for regular expressions at all here, just do this: 这里根本不需要正则表达式,只需这样做:

with open('utf8.txt') as f:
    class_name = 'aaa'
    for line in f:
        if line == '\n':
            classname = 'bbb'
        else:
            # decode / convert line
            line = '<p class="{0}">{1}</p>\n'.format(class_name, line.rstrip())
        # write line to file

The results you are getting do not look to be caused by the regular expressions as they appear to be correct. 您获得的结果看起来并不是由正则表达式引起的,因为它们似乎是正确的。 The problem is most likely in the line where you do your encoding / converting. 问题很可能出在您进行编码/转换的行中。 Print that line without adding the tags to see if it is as expected. 打印该行而不添加标记以查看它是否符合预期。

#!/usr/bin/env python
import cgi
import fileinput
import os
import shutil
import sys

def textfiles(rootdir, extensions=('.txt',)):
    for dirpath, dirs, files in os.walk(rootdir):
        for f in files:
            if f.lower().endswith(extensions):
               yield os.path.join(dirpath, f)

def htmlfiles(files):
    for f in files:
        root, _ = os.path.splitext(f)
        newf = root + '.html'
        shutil.copy2(f, newf)
        yield newf

for line in fileinput.input(htmlfiles(textfiles(sys.argv[1])), inplace=True):
    if fileinput.isfirstline():
       klass = 'aaa' # start head part
    line = cgi.escape(line.decode('utf-8').strip())
    line = line.encode('ascii', 'xmlcharrefreplace')
    if not line: # empty line
       klass = 'bbb' # start tail part
       print(line)
    else:
       print('<p class="%s">%s</p>' % (klass, line))

Example

$ python txt2html.py c:\root\dir

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM