简体   繁体   中英

split a string that contain english and Hebrew in c#

I have this string:

string str = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל moshecohen@gmail.com";

and I'm trying to split it the following way:

string[0] = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל "
string[1] = "moshecohen@gmail.com"

I'm using this split method:

string[] split =  Regex.Split(str, @"^[א-ת]+$");

I want to split between Hebrew and English words, but if the last word is the same as the current add it to the last

But I can not make it work, what am I doing wrong?

Thanks

Try this:

string[] split = Regex.Split(str, @"(?<=[א-ת]+) (?=[A-z]+)")

?<= - lookbehind - Asserts what immediately PRECEDES the current position

?= - lookahead - Asserts what immediately FOLLOWS the current position

This will resolve the string "splitter" as the place between Hebrew and Latin characters

Here's one approach:

[\p{IsHebrew}\P{L}]+|\P{IsHebrew}+

Use this pattern with Regex.Matches :

var matches = Regex.Matches(input, @"[\p{IsHebrew}\P{L}]+|\P{IsHebrew}+");

The pattern has two parts. It either matches:

  • [\\p{IsHebrew}\\P{L}]+ - a block containing Hebrew characters and non-letters,

OR

  • \\P{IsHebrew}+ - a block of non-Hebrew characters (including non-Hebrew letters and other non-letter characters).

We're using Unicode Named Blocks like \\p{IsHebrew} and \\p{IsBasicLatin} .

A similar option is [\\p{IsHebrew}\\P{L}]+|[\\p{IsBasicLatin}\\P{L}]+ - is matches specifically a block with Latin (English) letters.

Working example: regex storm , C# example

Why don't you think differently? The question here is: How to get the emails from the text.

There is a lot of posts for this question.

For example, this

public static void emas(string text)
        {
            const string MatchEmailPattern =
           @"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@"
           + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
             + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
           + @"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
            Regex rx = new Regex(MatchEmailPattern,  RegexOptions.Compiled | RegexOptions.IgnoreCase);
            // Find matches.
            MatchCollection matches = rx.Matches(text);
            // Report the number of matches found.
            int noOfMatches = matches.Count;
            // Report on each match.
            foreach (Match match in matches)
            {
                Console.WriteLine(match.Value.ToString());
            }
        }

From your input string, we can consider that we can split the string to Hebrew and an email address in the end of the string.

Then the regex can be( just example):

\w*@gmail.com$

You can test the regex here: https://regexr.com/

The pattern in Regex.Split matches the delimiter and isn't included in the results. Looks like you want to split between the last Hebrew and first non-Hebrew character, eg :

Regex.Split(str,@"\p{IsHebrew} \P{IsHebrew}")

\\p{} captures a character that belongs to a specific Unicode character class or named block while \\P{} excludes it.

Unfortunately, this pattern will exclude the last Hebrew and first non-Hebrew character and return :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות   
oshecohen@gmail.com 

Capture groups are used to include characters captured by a delimiter pattern in the results. Simply using a group though with (\\p{IsHebrew}) (\\P{IsHebrew}) will return each capture group as a separate result :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות  
ל 
m 
oshecohen@gmail.com 

Vladi Pavelka's use of forward and back references fixes this and (?<=\\p{IsHebrew}) (?=\\P{IsHebrew}) will return the expected results :

Regex.Split(str,@"(?<=\p{IsHebrew}) (?=\P{IsHebrew})")

will return :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל 
moshecohen@gmail.com 

why not simply use \\p{IsHebrew} ?

something like this

 string str = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל moshecohen@gmail.com";
 string pattern = @"[\p{IsHebrew}]+";
 var hebrewMatchCollection = Regex.Matches(str, pattern);
 string hebrewPart = string.Join(" ", hebrewMatchCollection.Cast<Match>().Select(m => m.Value));  //combine regex collection
 var englishPart = Regex.Split(str, pattern).Last(); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM