简体   繁体   English

如何在c#中获取字符串的所有单词?

[英]How to get all words of a string in c#?

I have a paragraph in a single string and I'd like to get all the words in that paragraph. 我在一个字符串中有一个段落,我想得到该段落中的所有单词。

My problem is that I don't want the suffixes words that end with punctuation marks such as (',','.',''','"',';',':','!','?') and /n /t etc. 我的问题是我不希望后缀的单词以标点符号结尾,例如(',','。',''',''',';',':','!','? ')和/ n / t等

I also don't want words with 's and 'm such as world's where it should only return world. 我也不想要像world's那样只有回归世界的话语。

In the example he said. "My dog's bone, toy, are missing!" he said. "My dog's bone, toy, are missing!"的例子中he said. "My dog's bone, toy, are missing!" he said. "My dog's bone, toy, are missing!"

the list should be: he said my dog bone toy are missing 名单应该是: he said my dog bone toy are missing

Expanding on Shan's answer , I would consider something like this as a starting point: 根据Shan的回答 ,我会考虑这样的出发点:

MatchCollection matches = Regex.Match(input, @"\b[\w']*\b");

Why include the ' character? 为什么包括'角色? Because this will prevent words like "we're" from being split into two words. 因为这会阻止像“我们”这样的单词被分成两个单词。 After capturing it, you can manually strip out the suffix yourself (whereas otherwise, you couldn't recognize that re is not a word and ignore it). 捕获后,您可以自己手动删除后缀(否则,您无法识别re不是单词而忽略它)。

So: 所以:

static string[] GetWords(string input)
{
    MatchCollection matches = Regex.Matches(input, @"\b[\w']*\b");

    var words = from m in matches.Cast<Match>()
                where !string.IsNullOrEmpty(m.Value)
                select TrimSuffix(m.Value);

    return words.ToArray();
}

static string TrimSuffix(string word)
{
    int apostropheLocation = word.IndexOf('\'');
    if (apostropheLocation != -1)
    {
        word = word.Substring(0, apostropheLocation);
    }

    return word;
}

Example input: 输入示例:

he said. "My dog's bone, toy, are missing!" What're you doing tonight, by the way?

Example output: 示例输出:

[he, said, My, dog, bone, toy, are, missing, What, you, doing, tonight, by, the, way]

One limitation of this approach is that it will not handle acronyms well; 这种方法的一个限制是它不能很好地处理首字母缩略词; eg, "YMCA" would be treated as four words. 例如,“YMCA”将被视为四个单词。 I think that could also be handled by including . 我认为也可以通过包括来处理. as a character to match in a word and then stripping it out if it's a full stop afterwards (ie, by checking that it's the only period in the word as well as the last character). 作为一个字符在一个单词中匹配,然后在它之后完全停止时将其剥离(即通过检查它是单词中的唯一句点以及最后一个字符)。

Hope this is helpful for you: 希望这对你有所帮助:

        string[] separators = new string[] {",", ".", "!", "\'", " ", "\'s"};
        string text = "My dog's bone, toy, are missing!";

        foreach (string word in text.Split(separators, StringSplitOptions.RemoveEmptyEntries))
            Console.WriteLine(word);

See Regex word boundary expressions , What is the most efficient way to count all of the words in a richtextbox? 请参阅正则表达式单词边界表达式计算richtextbox中所有单词的最有效方法是什么? . Moral of the story is that there are many ways to approach the problem, but regular expressions are probably the way to go for simplicity. 故事的道德是有很多方法来解决问题,但正则表达式可能是简单的方法。

在空格上拆分,修剪结果字符串上不是字母的任何内容。

Here's a looping replace method... not fast, but a way to solve it... 这是一个循环替换方法......不是很快,但解决它的方法......

string result = "string to cut ' stuff. ! out of";

".',!@".ToCharArray().ToList().ForEach(a => result = result.Replace(a.ToString(),""));

This assumes you want to place it back in the original string, not a new string or a list. 这假设您要将其放回原始字符串中,而不是新字符串或列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM