简体   繁体   English

用Erlang re解析\\“ – \\”

[英]Parsing \“–\” with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment 我已经用mochiweb_html解析了HTML页面,并想解析以下文本片段

0 – 1 0 – 1

Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters. 基本上,我想在空格和破折号上分割字符串,并提取第一个字符中的数字。

Now the string above is represented as the following Erlang list 现在,上面的字符串表示为以下Erlang列表

[48,32,226,128,147,32,49]

I'm trying to split it using the following regex: 我正在尝试使用以下正则表达式拆分它:

{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147

re:split([48,32,226,128,147,32,49], P, [{return, list}])

But this doesn't work; 但这是行不通的。 it seems the \\xD2 character is the problem [if I remove it from the regex, the split occurs] 看来\\ xD2字符是问题[如果我从正则表达式中删除它,则会发生拆分]

Could someone possibly explain 有人可以解释吗

  • what I'm doing wrong here ? 我在这里做错了什么?
  • why the '–' character seemingly requires three integers for representation [226, 128, 147] 为什么'–'字符似乎需要三个整数来表示[226,128,147]

Thanks. 谢谢。

226,128,147 is E2,80,93 in hex. 226,128,147为E2,80,93(十六进制)。

> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).  
["0 "," 1"]

As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). 关于第二个问题,为什么一个破折号需要3个字节进行编码,这是因为输入中的破折号不是ASCII连字符(十六进制2D),而是Unicode破折号 (十六进制2013)。 Your code is recieving this in UTF-8 encoding , rather than the more obvious UCS-2 encoding . 您的代码使用UTF-8编码接收此信息,而不是使用更明显的UCS-2编码接收 Hex 2013 comes out to hex E28093 in UTF-8 encoding. Hex 2013以UTF-8编码出现在E28093十六进制中。

If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. 如果您的下一个问题是“为什么要使用UTF-8”,那是因为与将所有内容扩展到UCS-2或NCS相比,使用8位字符和以N结尾的C样式字符串通过UTF-8使用Unicode改造旧系统要容易得多。 UCS-4。 UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. UTF-8仍然与ASCII和C字符串兼容,因此转换可以在数年的过程中进行,甚至需要数十年的时间。 Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. 宽字符需要一次性进行“ Big Bang”转换,所有内容都必须立即移至新系统。 UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created. 因此,UTF-8在具有可追溯到90年代初创建Unicode之前的传统的系统上更为流行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM