如何将带有实体ref的unicode字符串与非unicode字符串进行比较

Question

I am evaluating hundreds of thousands of html files. 我正在评估数十万个html文件。 I am looking for particular parts of the files. 我正在寻找文件的特定部分。 There can be small variations in the way the files were created 文件的创建方式可能会有细微的变化

For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space: 例如，在一个文件中，我可以具有节标题（将其转换为上半部分并拆分后再加入文本，以消除可能不一致的空白：

u'KEY1A\x97RISKFACTORS'

In another file I could have: 在另一个文件中，我可以有：

'KEY1ARISKFACTORS'

I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. 我正在尝试创建一个可能的响应的字典，我想比较这两者并得出结论，它们是相等的。 But every substitution I try to run the first string to remove the '\\97 does not seem to work 但是我尝试运行第一个字符串以删除'\\ 97的每次替换似乎都不起作用

There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like: 实体的各种表示形式都有相当多的键变体，因此我真的很想自动创建字典，所以我有类似以下内容：

key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}

I am assuming that since when I run 我假设自从我跑步

S1='A'
S2=u'A'
S1==S2

I get 我懂了

True

I should be able to compare these once the html entities are handled 处理html实体后，我应该能够比较这些

What I specifically tried to do is 我专门尝试做的是

new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')

I got an error 我有一个错误

Sorry, I have been at this since last night. 抱歉，自昨晚以来我一直在此。 SLott pointed out something and I see I used the wrong label I hope this makes more sense SLott指出了一些问题，我发现我使用了错误的标签，希望这样做更有意义

Answer 1

You are correct that if S1='A' and S2 = u'A', then S1 == S2. 您是正确的，如果S1 ='A'并且S2 = u'A'，那么S1 == S2。 Instead of assuming this though, you can do a simple test: 但是，您可以做一个简单的测试，而不是假设这样做：

key_dict= {u'A':'Value1',
        'A':'Value2'}

print key_dict
print u'A' == 'A'

This outputs: 输出：

{u'A': 'Value2'}
True

That resolved, let's look at: 解决了，让我们看一下：

new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')

There's a problem here, \\x97 is the value you're trying to replace in the target string. 这里有一个问题，\\ x97是您要在目标字符串中替换的值。 However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. 但是，您的搜索字符串是“ |”，它是十六进制值0x7C（ascii和unicode），显然不是您需要替换的值。 Even if the target and search string were both ascii or unicode, you'd still not find the '\\x97'. 即使目标和搜索字符串都是ascii或unicode，您仍然找不到'\\ x97'。 Second problem is that you are trying to search for a non-unicode string in a unicode string. 第二个问题是您试图在Unicode字符串中搜索非Unicode字符串。 The easiest solution, and one that makes the most sense is to simply search for u'\\x97': 最简单，最有意义的解决方案是仅搜索u'\\ x97'：

print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')

Outputs: 输出：

KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES

Answer 2

Why not the obvious .replace(u'\\x97','') ? 为什么不使用明显的.replace(u'\\x97','') ？ Where does the idea of that '|' '|'的想法在哪里'|' come from? 来自？

>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'

如何将带有实体ref的unicode字符串与非unicode字符串进行比较

问题描述

2 个解决方案

解决方案1
2 2010-08-21 23:54:45

解决方案2
1 已采纳 2010-08-21 23:26:54

如何将带有实体ref的unicode字符串与非unicode字符串进行比较

问题描述

2 个解决方案

解决方案1 2 2010-08-21 23:54:45

解决方案2 1 已采纳 2010-08-21 23:26:54

解决方案1
2 2010-08-21 23:54:45

解决方案2
1 已采纳 2010-08-21 23:26:54