简体   繁体   English

如何用Java中的正则表达式捕获希伯来语?

[英]How to capture Hebrew with regex in Java?

I'm trying to catch a section of Hebrew text (the origin is comments on a news site) using the following regex: 我正在尝试使用以下正则表达式捕获希伯来语文本的一部分(原点是新闻网站上的评论):

[\u0590-\u05FF \\p{Graph} \\s]+

It works for most comments but some comments are missed. 它适用于大多数评论,但错过了一些评论。

I've tried to debug this and it seems there's a Hebrew letter that doesn't match the pattern. 我试图调试这个,似乎有一个与模式不匹配的希伯来字母。

When I extract this letter and print it's integer value it seems to be correct but still the regex doesn't catch it... 当我提取这个字母并打印它的整数值时,它似乎是正确的但仍然正则表达式没有抓住它...

Ideas? 想法?

It would be more sematically correct to use \\p{InHebrew} instead of \֐-\׿ 使用\\p{InHebrew}而不是\֐-\׿

Also you need to match punctuation, digits (at least, world-common ones) and different kind of spaces. 你还需要匹配标点符号,数字(至少是世界常见的)和不同类型的空格。 I don't know what is \\p{Graph} and are there any Hebrew-specific punctuation symbols, but it seemed, you missed some parts. 我不知道什么是\\p{Graph}并且是否有任何希伯来语特定的标点符号,但似乎你错过了一些部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM