如何用Java中的正则表达式捕获希伯来语？

Question

I'm trying to catch a section of Hebrew text (the origin is comments on a news site) using the following regex: 我正在尝试使用以下正则表达式捕获希伯来语文本的一部分（原点是新闻网站上的评论）：

[\u0590-\u05FF \\p{Graph} \\s]+

It works for most comments but some comments are missed. 它适用于大多数评论，但错过了一些评论。

I've tried to debug this and it seems there's a Hebrew letter that doesn't match the pattern. 我试图调试这个，似乎有一个与模式不匹配的希伯来字母。

When I extract this letter and print it's integer value it seems to be correct but still the regex doesn't catch it... 当我提取这个字母并打印它的整数值时，它似乎是正确的但仍然正则表达式没有抓住它...

Ideas? 想法？

Answer 1

It would be more sematically correct to use \\p{InHebrew} instead of \֐-\׿ 使用\\p{InHebrew}而不是\֐-\׿

Also you need to match punctuation, digits (at least, world-common ones) and different kind of spaces. 你还需要匹配标点符号，数字（至少是世界常见的）和不同类型的空格。 I don't know what is \\p{Graph} and are there any Hebrew-specific punctuation symbols, but it seemed, you missed some parts. 我不知道什么是\\p{Graph}并且是否有任何希伯来语特定的标点符号，但似乎你错过了一些部分。

如何用Java中的正则表达式捕获希伯来语？

问题描述

1 个解决方案

解决方案1
1 2012-01-24 13:00:02

如何用Java中的正则表达式捕获希伯来语？

问题描述

1 个解决方案

解决方案1 1 2012-01-24 13:00:02

解决方案1
1 2012-01-24 13:00:02