Python正则表达式为希腊词

Question

我想创建一个python脚本，它使用正则表达式来过滤我提供的源文本中包含某些希腊词的行，然后根据遇到的单词将这些行写入3个不同的文件。

到目前为止，这是我的代码：

import regex

source=open('source.txt', 'r')
oti=open('results_oti.txt', 'w')
tis=open('results_tis.txt', 'w')
ton=open('results_ton.txt', 'w')

regex_oti='^.*\b(ότι|ό,τι)\b.*$'
regex_tis='^.*\b(της|τις)\b.*$'
regex_ton='^.*\b(τον|των)\b.*$'

for line in source.readlines():
    if regex.match(regex_oti, line):
        oti.write(line)
    if regex.match(regex_tis, line):
        tis.write(line)
    if regex.match(regex_ton, line):
        ton.write(line)
source.close()
oti.close()
tis.close()
ton.close()
quit()

我检查的单词是ότι | ό,τι | της | τις | τον | των ότι | ό,τι | της | τις | τον | των ότι | ό,τι | της | τις | τον | των 。

问题是那3个正则表达式（ regex_oti ， regex_tis ， regex_ton ）与任何东西都不匹配，因此我创建的3个文本文件不包含任何内容。

也许它的编码问题（Unicode）？

Answer 1

您尝试将编码值（作为字节）与最可能不匹配的正则表达式进行匹配，除非您的Python源编码与输入文件的编码完全匹配，然后仅在您不使用多字节编码（如UTF-8。

您需要将输入文件解码为Unicode值，并使用Unicode正则表达式。 这意味着您需要知道用于输入文件的编解码器。 使用io.open()处理解码和编码最简单：

import io
import re

regex_oti = re.compile(ur'^.*\b(ότι|ό,τι)\b.*$')
regex_tis = re.compile(ur'^.*\b(της|τις)\b.*$')
regex_ton = re.compile(ur'^.*\b(τον|των)\b.*$')

with io.open('source.txt', 'r', encoding='utf8') as source, \
     io.open('results_oti.txt', 'w', encoding='utf8') as oti, \
     io.open('results_tis.txt', 'w', encoding='utf8') as tis, \
     io.open('results_ton.txt', 'w', encoding='utf8') as ton:

    for line in source:
        if regex_oti.match(line):
            oti.write(line)
        if regex_tis.match(line):
            tis.write(line)
        if regex_ton.match(line):
            ton.write(line)

注意你的ur'...'原始unicode字符串来定义正则表达式模式; 现在这些是Unicode模式和匹配代码点 ，而不是字节。

io.open()调用确保您读取unicode ，并且当您将unicode值写入输出文件时，数据将自动编码为UTF-8。 我也为输入文件选择了UTF-8，但你需要检查该文件的正确编解码器是什么，并坚持下去。

我在这里使用了一个with语句来自动关闭文件，将source用作可迭代的（不需要一次读取所有行到内存中），并预编译正则表达式。

Python正则表达式为希腊词

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-11-13 21:38:06

Python正则表达式为希腊词

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-11-13 21:38:06

解决方案1
1 已采纳 2013-11-13 21:38:06