简体   繁体   English

如何仅解析带有正则表达式的HTML文件中文本中的外来字符

[英]How do I parse only foreign characters from the text in an HTML file with regular expressions

I'm trying to parse HTML and automatically change the font of any foreign characters, and I'm having some issues. 我正在尝试解析HTML并自动更改任何外来字符的字体,并且遇到了一些问题。 There are a few different hackish ways I'm trying to accomplish this, but none work really well, and I'm wondering if anyone has any ideas. 我尝试通过几种不同的方法来实现这一目标,但是没有一种方法能很好地工作,我想知道是否有人有任何想法。 Is there any easy way with python to match all the foreign characters (specifically, Japanese Kanji/Hirigana/Katakana) with regular expressions? 使用python有什么简单的方法可以将所有外来字符(特别是日语汉字/平假名/片假名)与正则表达式进行匹配? What I've been using is the complement of a set of non-foreign characters ([^A-Za-z0-9 <>'"=]), but this isn't working well, and I'm worried it will match things enclosed in <...>, which I don't want to do. 我一直在使用的是一组非外国字符([^ A-Za-z0-9 <>'“ =])的补码,但是效果不佳,我担心它会匹配<...>中包含的内容,我不想这样做。

I wouldn't use just regular expressions for this. 我不会只为此使用正则表达式。 Down that path lies an angry Tony the Pony . 沿着这条道路处于愤怒托尼小马

I'd use an HTML parser in conjuction with regular expressions, though. 不过,我会结合使用HTML解析器和正则表达式。 That way you can distinguish the markup from the non-markup. 这样,您就可以区分标记和非标记。

Use BeautifulSoup to get the content that you need, then use a variation on this code to match your characters. 使用BeautifulSoup获取所需的内容,然后对该代码使用变体以匹配您的字符。

import re

kataLetters = range(0x30A0, 0x30FF)
hiraLetters = range(0x3040, 0x309F)
kataPunctuation = range(0x31F0,0x31FF)

myLetters = kataLetters+kataPunctuation+hiraLetters

myLetters = u''.join([unichr(aLetter) for aLetter in myLetters])


myRe = re.compile('['+myLetters+']+', re.UNICODE)

Use the code charts here to get the ranges for your characters. 使用此处的代码表获取字符的范围。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM