简体   繁体   English

如何将带有实体ref的unicode字符串与非unicode字符串进行比较

[英]How to compare unicode strings with entity ref to non-unicode string

I am evaluating hundreds of thousands of html files. 我正在评估数十万个html文件。 I am looking for particular parts of the files. 我正在寻找文件的特定部分。 There can be small variations in the way the files were created 文件的创建方式可能会有细微的变化

For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space: 例如,在一个文件中,我可以具有节标题(将其转换为上半部分并拆分后再加入文本,以消除可能不一致的空白:

u'KEY1A\x97RISKFACTORS'

In another file I could have: 在另一个文件中,我可以有:

'KEY1ARISKFACTORS'

I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. 我正在尝试创建一个可能的响应的字典,我想比较这两者并得出结论,它们是相等的。 But every substitution I try to run the first string to remove the '\\97 does not seem to work 但是我尝试运行第一个字符串以删除'\\ 97的每次替换似乎都不起作用

There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like: 实体的各种表示形式都有相当多的键变体,因此我真的很想自动创建字典,所以我有类似以下内容:

key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}

I am assuming that since when I run 我假设自从我跑步

S1='A'
S2=u'A'
S1==S2

I get 我懂了

True

I should be able to compare these once the html entities are handled 处理html实体后,我应该能够比较这些

What I specifically tried to do is 我专门尝试做的是

new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')

I got an error 我有一个错误

Sorry, I have been at this since last night. 抱歉,自昨晚以来我一直在此。 SLott pointed out something and I see I used the wrong label I hope this makes more sense SLott指出了一些问题,我发现我使用了错误的标签,希望这样做更有意义

You are correct that if S1='A' and S2 = u'A', then S1 == S2. 您是正确的,如果S1 ='A'并且S2 = u'A',那么S1 == S2。 Instead of assuming this though, you can do a simple test: 但是,您可以做一个简单的测试,而不是假设这样做:

key_dict= {u'A':'Value1',
        'A':'Value2'}

print key_dict
print u'A' == 'A'

This outputs: 输出:

{u'A': 'Value2'}
True

That resolved, let's look at: 解决了,让我们看一下:

new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')

There's a problem here, \\x97 is the value you're trying to replace in the target string. 这里有一个问题,\\ x97是您要在目标字符串中替换的值。 However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. 但是,您的搜索字符串是“ |”,它是十六进制值0x7C(ascii和unicode),显然不是您需要替换的值。 Even if the target and search string were both ascii or unicode, you'd still not find the '\\x97'. 即使目标和搜索字符串都是ascii或unicode,您仍然找不到'\\ x97'。 Second problem is that you are trying to search for a non-unicode string in a unicode string. 第二个问题是您试图在Unicode字符串中搜索非Unicode字符串。 The easiest solution, and one that makes the most sense is to simply search for u'\\x97': 最简单,最有意义的解决方案是仅搜索u'\\ x97':

print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')

Outputs: 输出:

KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES

Why not the obvious .replace(u'\\x97','') ? 为什么不使用明显的.replace(u'\\x97','') Where does the idea of that '|' '|'的想法在哪里'|' come from? 来自?

>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM