簡體   English   中英

正則表達式找到單獨的單詞?

[英]Regular Expression to find separate words?

這是RegEx向導的快速入門。 我需要一個能找到單詞組的正則表達式。 任何一組單詞。 例如,我希望它能在任何句子中找到前兩個單詞。

例如:“嗨,你好嗎?” -返回會是“嗨”

例如:“你好嗎?” -返回為“怎么樣”

嘗試這個:

^\w+\s+\w+

說明:一個或多個單詞字符,空格和一個或多個單詞字符一起。

正則表達式用於解析語言。 正則表達式是一種更自然的工具。 收集單詞后,使用詞典查看它們是否實際上是特定語言的單詞。

前提是定義一個正則表達式,該表達式將拆分出%99.9個可能的單詞, 單詞是關鍵定義。

我假設C#將使用基於5.8 Perl的PCRE。
這是我對如何拆分單詞(擴展)的ascii定義:

regex = '[\\s[:punct:]]* (\\w (?: \\w | [[:punct:]](?=[\\w[:punct:]]) )* )

和unicode(必須為套件特定的編碼添加/減去更多):

regex = '[\\s\\pP]* ([\\pL\\pN_-] (?: [\\pL\\pN_-] | \\pP(?=[\\pL\\pN\\pP_-]) )* )'

要查找所有單詞,請將正則表達式字符串放入正則表達式中(我不知道C#):

@matches =~ /$regex/xg

/ xg是擴展的和全局修飾符。 請注意,正則表達式字符串中僅存在捕獲組1,因此不會捕獲中間文本。

僅查找第一兩個

@matches =~ /(?:$regex)(?:$regex)/x

下面是一個Perl示例。 無論如何,玩弄它。 干杯!

use strict;
use warnings;

binmode (STDOUT,':utf8');

# Unicode
my $regex = qr/ [\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* ) /x;

# Ascii
# my $regex = qr/ [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) /x;


my $text = q(
  I confirm that sufficient information and detail have been
  reported in this technical report, that it's "scientifically" sound,
  and that appropriate conclusion's have been included
);
print "\n**\n$text\n"; 

my @matches = $text =~ /$regex/g;
print "\nTotal ".scalar(@matches)." words\n",'-'x20,"\n";
for (@matches) {
    print "$_\n";
}

# =======================================

my $junk = q(
Hi, there, A écafé and Horse d'oeuvre 
hasn't? 'n? '? a-b? -'a-? 
);
print "\n\n**\n$junk\n"; 

# First 2 words
@matches = $junk =~ /(?:$regex)(?:$regex)/;
print "\nFirst 2 words\n",'-'x20,"\n";
for (@matches) {
    print "$_\n";
}

# All words
@matches = $junk =~ /$regex/g;
print "\nTotal ".scalar(@matches)." words\n",'-'x20,"\n";
for (@matches) {
    print "$_\n";
}

輸出:
**

I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included


Total 25 words
--------------------
I
confirm
that
sufficient
information
and
detail
have
been
reported
in
this
technical
report
that
it's
scientifically
sound
and
that
appropriate
conclusion's
have
been
included


**

Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? ab? -'a-?

First 2 words
--------------------
Hi
there

Total 11 words
--------------------
Hi
there
A
écafé
and
Horse
d'oeuvre
hasn't
n
ab
a-

@ 魯本斯·法里亞斯

根據我的評論,這是我使用的代碼:

public int startAt = 0;

private void btnGrabWordPairs_Click(object sender, EventArgs e)
    {
        Regex regex = new Regex(@"\b\w+\s+\w+\b"); //Start at word boundary, find one or more word chars, one or more whitespaces, one or more chars, end at word boundary

        if (startAt <= txtTest.Text.Length)
        {
            string match = regex.Match(txtArticle.Text, startAt).ToString();
            MessageBox.Show(match);
            startAt += match.Length; //update the starting position to the end of the last match
        }
     {

每次單擊該按鈕時,它都會很好地捕獲成對的單詞,依次遍歷txtTest TextBox中的文本,並順序查找對,直到到達字符串的末尾。

// @ sln :非常感謝您的詳細答復!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM