简体   繁体   English

您将如何计算字符串中字符串(实际上是字符)的出现次数?

[英]How would you count occurrences of a string (actually a char) within a string?

I am doing something where I realised I wanted to count how many / s I could find in a string, and then it struck me, that there were several ways to do it, but couldn't decide on what the best (or easiest) was.我正在做一些事情,我意识到我想计算我可以在一个字符串中找到多少/ s,然后让我震惊的是,有几种方法可以做到这一点,但无法决定最好的(或最简单的)曾是。

At the moment I'm going with something like:目前我正在做类似的事情:

string source = "/once/upon/a/time/";
int count = source.Length - source.Replace("/", "").Length;

But I don't like it at all, any takers?但我一点都不喜欢,有人要吗?

I don't really want to dig out RegEx for this, do I?我真的不想为此挖掘RegEx ,对吗?

I know my string is going to have the term I'm searching for, so you can assume that...我知道我的字符串将包含我正在搜索的术语,因此您可以假设...

Of course for strings where length > 1 ,当然对于长度 > 1的字符串

string haystack = "/once/upon/a/time";
string needle = "/";
int needleCount = ( haystack.Length - haystack.Replace(needle,"").Length ) / needle.Length;

If you're using .NET 3.5 you can do this in a one-liner with LINQ:如果您使用 .NET 3.5,您可以使用 LINQ 在单行中执行此操作:

int count = source.Count(f => f == '/');

If you don't want to use LINQ you can do it with:如果你不想使用 LINQ,你可以这样做:

int count = source.Split('/').Length - 1;

You might be surprised to learn that your original technique seems to be about 30% faster than either of these!您可能会惊讶地发现,您的原始技术似乎比其中任何一种都快 30%! I've just done a quick benchmark with "/once/upon/a/time/" and the results are as follows:我刚刚用“/once/upon/a/time/”做了一个快速基准测试,结果如下:

Your original = 12s你原来的 = 12s
source.Count = 19s source.Count = 19s
source.Split = 17s source.Split = 17s
foreach ( from bobwienholt's answer ) = 10s foreach(来自 bobwienholt 的回答)= 10s

(The times are for 50,000,000 iterations so you're unlikely to notice much difference in the real world.) (时间为 50,000,000 次迭代,因此您不太可能注意到现实世界中的太大差异。)

string source = "/once/upon/a/time/";
int count = 0;
foreach (char c in source) 
  if (c == '/') count++;

Has to be faster than the source.Replace() by itself.必须比source.Replace()本身更快。

int count = new Regex(Regex.Escape(needle)).Matches(haystack).Count;

If you want to be able to search for whole strings, and not just characters:如果您希望能够搜索整个字符串,而不仅仅是字符:

src.Select((c, i) => src.Substring(i))
    .Count(sub => sub.StartsWith(target))

Read as "for each character in the string, take the rest of the string starting from that character as a substring; count it if it starts with the target string."读作“对于字符串中的每个字符,将从该字符开始的字符串的其余部分作为子字符串;如果它以目标字符串开头,则计算它。”

I've made some research and found that Richard Watson's solution is fastest in most cases.我做了一些研究,发现Richard Watson 的解决方案在大多数情况下是最快的。 That's the table with results of every solution in the post (except those use Regex because it throws exceptions while parsing string like "test{test")这是帖子中每个解决方案结果的表格(使用正则表达式的除外,因为它在解析像“test{test”这样的字符串时会抛出异常)

    Name      | Short/char |  Long/char | Short/short| Long/short |  Long/long |
    Inspite   |         134|        1853|          95|        1146|         671|
    LukeH_1   |         346|        4490|         N/A|         N/A|         N/A|
    LukeH_2   |         152|        1569|         197|        2425|        2171|
Bobwienholt   |         230|        3269|         N/A|         N/A|         N/A|
Richard Watson|          33|         298|         146|         737|         543|
StefanosKargas|         N/A|         N/A|         681|       11884|       12486|

You can see that in case of finding number of occurences of short substrings (1-5 characters) in short string(10-50 characters) the original algorithm is preferred.您可以看到,如果在短字符串(10-50 个字符)中找到短子字符串(1-5 个字符)的出现次数,则首选原始算法。

Also, for multicharacter substring you should use the following code (based on Richard Watson's solution)此外,对于多字符子字符串,您应该使用以下代码(基于Richard Watson 的解决方案)

int count = 0, n = 0;

if(substring != "")
{
    while ((n = source.IndexOf(substring, n, StringComparison.InvariantCulture)) != -1)
    {
        n += substring.Length;
        ++count;
    }
}

LINQ works on all collections, and since strings are just a collection of characters, how about this nice little one-liner: LINQ 适用于所有集合,并且由于字符串只是字符的集合,那么这个漂亮的小单行怎么样:

var count = source.Count(c => c == '/');

Make sure you have using System.Linq;确保你有using System.Linq; at the top of your code file, as .Count is an extension method from that namespace.在代码文件的顶部,因为.Count是来自该命名空间的扩展方法。

string source = "/once/upon/a/time/";
int count = 0;
int n = 0;

while ((n = source.IndexOf('/', n)) != -1)
{
   n++;
   count++;
}

On my computer it's about 2 seconds faster than the for-every-character solution for 50 million iterations.在我的计算机上,它比 5000 万次迭代的 for-every-character 解决方案快约 2 秒。

2013 revision: 2013年修订:

Change the string to a char[] and iterate through that.将字符串更改为 char[] 并遍历它。 Cuts a further second or two off the total time for 50m iterations!将 50m 迭代的总时间再缩短一两秒!

char[] testchars = source.ToCharArray();
foreach (char c in testchars)
{
     if (c == '/')
         count++;
}

This is quicker still:这仍然更快:

char[] testchars = source.ToCharArray();
int length = testchars.Length;
for (int n = 0; n < length; n++)
{
    if (testchars[n] == '/')
        count++;
}

For good measure, iterating from the end of the array to 0 seems to be the fastest, by about 5%.为了更好地衡量,从数组末尾迭代到 0 似乎是最快的,大约 5%。

int length = testchars.Length;
for (int n = length-1; n >= 0; n--)
{
    if (testchars[n] == '/')
        count++;
}

I was wondering why this could be and was Googling around (I recall something about reverse iterating being quicker), and came upon this SO question which annoyingly uses the string to char[] technique already.我想知道为什么这可能并且正在谷歌搜索(我记得一些关于反向迭代更快的事情),并遇到了这个问题,该问题已经烦人地使用字符串到字符 [] 技术。 I think the reversal trick is new in this context, though.不过,我认为在这种情况下逆转技巧是新的。

What is the fastest way to iterate through individual characters in a string in C#? 在 C# 中遍历字符串中单个字符的最快方法是什么?

These both only work for single-character search terms...这些都只适用于单字符搜索词...

countOccurences("the", "the answer is the answer");

int countOccurences(string needle, string haystack)
{
    return (haystack.Length - haystack.Replace(needle,"").Length) / needle.Length;
}

may turn out to be better for longer needles...对于更长的针头可能会更好......

But there has to be a more elegant way.但必须有更优雅的方式。 :) :)

编辑:

source.Split('/').Length-1

In C#, a nice String SubString counter is this unexpectedly tricky fellow:在 C# 中,一个不错的 String SubString 计数器就是这个出人意料的棘手家伙:

public static int CCount(String haystack, String needle)
{
    return haystack.Split(new[] { needle }, StringSplitOptions.None).Length - 1;
}
Regex.Matches(input,  Regex.Escape("stringToMatch")).Count
private int CountWords(string text, string word) {
    int count = (text.Length - text.Replace(word, "").Length) / word.Length;
    return count;
}

Because the original solution, was the fastest for chars, I suppose it will also be for strings.因为原始解决方案对于字符来说是最快的,所以我想它也适用于字符串。 So here is my contribution.所以这是我的贡献。

For the context: I was looking for words like 'failed' and 'succeeded' in a log file.对于上下文:我在日志文件中寻找诸如“失败”和“成功”之类的词。

Gr, Ben Gr, 本

string s = "65 fght 6565 4665 hjk";
int count = 0;
foreach (Match m in Regex.Matches(s, "65"))
  count++;
public static int GetNumSubstringOccurrences(string text, string search)
{
    int num = 0;
    int pos = 0;

    if (!string.IsNullOrEmpty(text) && !string.IsNullOrEmpty(search))
    {
        while ((pos = text.IndexOf(search, pos)) > -1)
        {
            num ++;
            pos += search.Length;
        }
    }
    return num;
}

For anyone wanting a ready to use String extension method,对于任何想要使用 String 扩展方法的人,

here is what I use which was based on the best of the posted answers:这是我使用的基于最佳已发布答案的内容:

public static class StringExtension
{    
    /// <summary> Returns the number of occurences of a string within a string, optional comparison allows case and culture control. </summary>
    public static int Occurrences(this System.String input, string value, StringComparison stringComparisonType = StringComparison.Ordinal)
    {
        if (String.IsNullOrEmpty(value)) return 0;

        int count    = 0;
        int position = 0;

        while ((position = input.IndexOf(value, position, stringComparisonType)) != -1)
        {
            position += value.Length;
            count    += 1;
        }

        return count;
    }

    /// <summary> Returns the number of occurences of a single character within a string. </summary>
    public static int Occurrences(this System.String input, char value)
    {
        int count = 0;
        foreach (char c in input) if (c == value) count += 1;
        return count;
    }
}

I think the easiest way to do this is to use the Regular Expressions.我认为最简单的方法是使用正则表达式。 This way you can get the same split count as you could using myVar.Split('x') but in a multiple character setting.通过这种方式,您可以获得与使用 myVar.Split('x') 相同的拆分计数,但在多字符设置中。

string myVar = "do this to count the number of words in my wording so that I can word it up!";
int count = Regex.Split(myVar, "word").Length;
string search = "/string";
var occurrences = (regex.Match(search, @"\/")).Count;

This will count each time the program finds "/s" exactly (case sensitive) and the number of occurrences of this will be stored in the variable "occurrences"每次程序准确地找到“/s”(区分大小写)时,这将计数,并且出现的次数将存储在变量“occurrences”中

I felt that we were lacking certain kinds of sub string counting, like unsafe byte-by-byte comparisons.我觉得我们缺少某些类型的子字符串计数,比如不安全的逐字节比较。 I put together the original poster's method and any methods I could think of.我把原始海报的方法和我能想到的任何方法放在一起。

These are the string extensions I made.这些是我做的字符串扩展。

namespace Example
{
    using System;
    using System.Text;

    public static class StringExtensions
    {
        public static int CountSubstr(this string str, string substr)
        {
            return (str.Length - str.Replace(substr, "").Length) / substr.Length;
        }

        public static int CountSubstr(this string str, char substr)
        {
            return (str.Length - str.Replace(substr.ToString(), "").Length);
        }

        public static int CountSubstr2(this string str, string substr)
        {
            int substrlen = substr.Length;
            int lastIndex = str.IndexOf(substr, 0, StringComparison.Ordinal);
            int count = 0;
            while (lastIndex != -1)
            {
                ++count;
                lastIndex = str.IndexOf(substr, lastIndex + substrlen, StringComparison.Ordinal);
            }

            return count;
        }

        public static int CountSubstr2(this string str, char substr)
        {
            int lastIndex = str.IndexOf(substr, 0);
            int count = 0;
            while (lastIndex != -1)
            {
                ++count;
                lastIndex = str.IndexOf(substr, lastIndex + 1);
            }

            return count;
        }

        public static int CountChar(this string str, char substr)
        {
            int length = str.Length;
            int count = 0;
            for (int i = 0; i < length; ++i)
                if (str[i] == substr)
                    ++count;

            return count;
        }

        public static int CountChar2(this string str, char substr)
        {
            int count = 0;
            foreach (var c in str)
                if (c == substr)
                    ++count;

            return count;
        }

        public static unsafe int CountChar3(this string str, char substr)
        {
            int length = str.Length;
            int count = 0;
            fixed (char* chars = str)
            {
                for (int i = 0; i < length; ++i)
                    if (*(chars + i) == substr)
                        ++count;
            }

            return count;
        }

        public static unsafe int CountChar4(this string str, char substr)
        {
            int length = str.Length;
            int count = 0;
            fixed (char* chars = str)
            {
                for (int i = length - 1; i >= 0; --i)
                    if (*(chars + i) == substr)
                        ++count;
            }

            return count;
        }

        public static unsafe int CountSubstr3(this string str, string substr)
        {
            int length = str.Length;
            int substrlen = substr.Length;
            int count = 0;
            fixed (char* strc = str)
            {
                fixed (char* substrc = substr)
                {
                    int n = 0;

                    for (int i = 0; i < length; ++i)
                    {
                        if (*(strc + i) == *(substrc + n))
                        {
                            ++n;
                            if (n == substrlen)
                            {
                                ++count;
                                n = 0;
                            }
                        }
                        else
                            n = 0;
                    }
                }
            }

            return count;
        }

        public static int CountSubstr3(this string str, char substr)
        {
            return CountSubstr3(str, substr.ToString());
        }

        public static unsafe int CountSubstr4(this string str, string substr)
        {
            int length = str.Length;
            int substrLastIndex = substr.Length - 1;
            int count = 0;
            fixed (char* strc = str)
            {
                fixed (char* substrc = substr)
                {
                    int n = substrLastIndex;

                    for (int i = length - 1; i >= 0; --i)
                    {
                        if (*(strc + i) == *(substrc + n))
                        {
                            if (--n == -1)
                            {
                                ++count;
                                n = substrLastIndex;
                            }
                        }
                        else
                            n = substrLastIndex;
                    }
                }
            }

            return count;
        }

        public static int CountSubstr4(this string str, char substr)
        {
            return CountSubstr4(str, substr.ToString());
        }
    }
}

Followed by the test code...接下来是测试代码...

static void Main()
{
    const char matchA = '_';
    const string matchB = "and";
    const string matchC = "muchlongerword";
    const string testStrA = "_and_d_e_banna_i_o___pfasd__and_d_e_banna_i_o___pfasd_";
    const string testStrB = "and sdf and ans andeians andano ip and and sdf and ans andeians andano ip and";
    const string testStrC =
        "muchlongerword amuchlongerworsdfmuchlongerwordsdf jmuchlongerworijv muchlongerword sdmuchlongerword dsmuchlongerword";
    const int testSize = 1000000;
    Console.WriteLine(testStrA.CountSubstr('_'));
    Console.WriteLine(testStrA.CountSubstr2('_'));
    Console.WriteLine(testStrA.CountSubstr3('_'));
    Console.WriteLine(testStrA.CountSubstr4('_'));
    Console.WriteLine(testStrA.CountChar('_'));
    Console.WriteLine(testStrA.CountChar2('_'));
    Console.WriteLine(testStrA.CountChar3('_'));
    Console.WriteLine(testStrA.CountChar4('_'));
    Console.WriteLine(testStrB.CountSubstr("and"));
    Console.WriteLine(testStrB.CountSubstr2("and"));
    Console.WriteLine(testStrB.CountSubstr3("and"));
    Console.WriteLine(testStrB.CountSubstr4("and"));
    Console.WriteLine(testStrC.CountSubstr("muchlongerword"));
    Console.WriteLine(testStrC.CountSubstr2("muchlongerword"));
    Console.WriteLine(testStrC.CountSubstr3("muchlongerword"));
    Console.WriteLine(testStrC.CountSubstr4("muchlongerword"));
    var timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr(matchA);
    timer.Stop();
    Console.WriteLine("CS1 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr(matchB);
    timer.Stop();
    Console.WriteLine("CS1 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr(matchC);
    timer.Stop();
    Console.WriteLine("CS1 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr2(matchA);
    timer.Stop();
    Console.WriteLine("CS2 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr2(matchB);
    timer.Stop();
    Console.WriteLine("CS2 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr2(matchC);
    timer.Stop();
    Console.WriteLine("CS2 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr3(matchA);
    timer.Stop();
    Console.WriteLine("CS3 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr3(matchB);
    timer.Stop();
    Console.WriteLine("CS3 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr3(matchC);
    timer.Stop();
    Console.WriteLine("CS3 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr4(matchA);
    timer.Stop();
    Console.WriteLine("CS4 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr4(matchB);
    timer.Stop();
    Console.WriteLine("CS4 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr4(matchC);
    timer.Stop();
    Console.WriteLine("CS4 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar(matchA);
    timer.Stop();
    Console.WriteLine("CC1 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar2(matchA);
    timer.Stop();
    Console.WriteLine("CC2 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar3(matchA);
    timer.Stop();
    Console.WriteLine("CC3 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar4(matchA);
    timer.Stop();
    Console.WriteLine("CC4 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
}

Results: CSX corresponds with CountSubstrX and CCX corresponds with CountCharX.结果:CSX 对应于 CountSubstrX,CCX 对应于 CountCharX。 "chr" searches a string for '_', "and" searches a string for "and", and "mlw" searches a string for "muchlongerword" “chr”在字符串中搜索“_”,“and”在字符串中搜索“and”,而“mlw”在字符串中搜索“muchlongerword”

CS1 chr: 824.123ms
CS1 and: 586.1893ms
CS1 mlw: 486.5414ms
CS2 chr: 127.8941ms
CS2 and: 806.3918ms
CS2 mlw: 497.318ms
CS3 chr: 201.8896ms
CS3 and: 124.0675ms
CS3 mlw: 212.8341ms
CS4 chr: 81.5183ms
CS4 and: 92.0615ms
CS4 mlw: 116.2197ms
CC1 chr: 66.4078ms
CC2 chr: 64.0161ms
CC3 chr: 65.9013ms
CC4 chr: 65.8206ms

And finally, I had a file with 3.6 million characters.最后,我有一个包含 360 万个字符的文件。 It was "derp adfderdserp dfaerpderp deasderp" repeated 100,000 times.它被“derp adfderdserp dfaerpderp deasderp”重复了100,000 次。 I searched for "derp" inside the file with the above methods 100 times these results.我使用上述方法在文件中搜索“derp”是这些结果的 100 倍。

CS1Derp: 1501.3444ms
CS2Derp: 1585.797ms
CS3Derp: 376.0937ms
CS4Derp: 271.1663ms

So my 4th method is definitely the winner, but, realistically, if a 3.6 million character file 100 times only took 1586ms as the worse case, then all of this is quite negligible.所以我的第四种方法绝对是赢家,但实际上,如果一个 360 万个字符文件 100 次只花费 1586 毫秒作为最坏的情况,那么所有这些都可以忽略不计。

By the way, I also scanned for the 'd' char in the 3.6 million character file with 100 times CountSubstr and CountChar methods.顺便说一句,我还使用 CountSubstr 和 CountChar 方法 100 次扫描了 360 万个字符文件中的“d”字符。 Results...结果...

CS1  d : 2606.9513ms
CS2  d : 339.7942ms
CS3  d : 960.281ms
CS4  d : 233.3442ms
CC1  d : 302.4122ms
CC2  d : 280.7719ms
CC3  d : 299.1125ms
CC4  d : 292.9365ms

The original posters method is very bad for single character needles in a large haystack according to this.原来的海报方法对大海捞针的单字针是很不好的。

Note: All values were updated to Release version output.注意:所有值都更新为发布版本输出。 I accidentally forgot to build on Release mode upon the first time I posted this.我第一次发布时不小心忘记了在发布模式上构建。 Some of my statements have been amended.我的一些陈述已被修改。

string source = "/once/upon/a/time/";
int count = 0, n = 0;
while ((n = source.IndexOf('/', n) + 1) != 0) count++;

A variation on Richard Watson's answer, slightly faster with improving efficiency the more times the char occurs in the string, and less code!理查德沃森的答案的一个变体,随着字符在字符串中出现的次数越多,代码越少,效率提高的速度稍快!

Though I must say, without extensively testing every scenario, I did see a very significant speed improvement by using:虽然我必须说,在没有对每个场景进行广泛测试的情况下,我确实看到使用以下方法显着提高了速度:

int count = 0;
for (int n = 0; n < source.Length; n++) if (source[n] == '/') count++;
            var conditionalStatement = conditionSetting.Value;

            //order of replace matters, remove == before =, incase of ===
            conditionalStatement = conditionalStatement.Replace("==", "~").Replace("!=", "~").Replace('=', '~').Replace('!', '~').Replace('>', '~').Replace('<', '~').Replace(">=", "~").Replace("<=", "~");

            var listOfValidConditions = new List<string>() { "!=", "==", ">", "<", ">=", "<=" };

            if (conditionalStatement.Count(x => x == '~') != 1)
            {
                result.InvalidFieldList.Add(new KeyFieldData(batch.DECurrentField, "The IsDoubleKeyCondition does not contain a supported conditional statement. Contact System Administrator."));
                result.Status = ValidatorStatus.Fail;
                return result;
            }

Needed to do something similar to test conditional statements from a string.需要做一些类似于从字符串测试条件语句的事情。

Replaced what i was looking for with a single character and counted the instances of the single character.用单个字符替换我正在寻找的内容并计算单个字符的实例。

Obviously the single character you're using will need to be checked to not exist in the string before this happens to avoid incorrect counts.显然,在发生这种情况之前,需要检查您正在使用的单个字符是否存在于字符串中,以避免计数不正确。

String in string:字符串中的字符串:

Find "etc" in " .. JD JD JD JD etc. and etc. JDJDJDJDJDJDJDJD and etc."在“..JD JD JD JD 等等等等JDJDJDJDJDJDJDJD 等等”中找到“etc”

var strOrigin = " .. JD JD JD JD etc. and etc. JDJDJDJDJDJDJDJD and etc.";
var searchStr = "etc";
int count = (strOrigin.Length - strOrigin.Replace(searchStr, "").Length)/searchStr.Length.

Check performance before discarding this one as unsound/clumsy...在丢弃这个不健全/笨拙的人之前检查性能......

Thought I would throw my extension method into the ring (see comments for more info).以为我会将我的扩展方法扔进戒指中(有关更多信息,请参阅评论)。 I have not done any formal bench marking, but I think it has to be very fast for most scenarios.我没有做过任何正式的基准测试,但我认为在大多数情况下它必须非常快。

EDIT: OK - so this SO question got me to wondering how the performance of our current implementation would stack up against some of the solutions presented here.编辑:好的 - 所以这个问题让我想知道我们当前实现的性能将如何与这里提供的一些解决方案相提并论。 I decided to do a little bench marking and found that our solution was very much in line with the performance of the solution provided by Richard Watson up until you are doing aggressive searching with large strings (100 Kb +), large substrings (32 Kb +) and many embedded repetitions (10K +).我决定做一点基准测试,发现我们的解决方案非常符合Richard Watson提供的解决方案的性能,直到您使用大字符串 (100 Kb +)、大子字符串 (32 Kb + ) 和许多嵌入的重复 (10K +)。 At that point our solution was around 2X to 4X slower.那时我们的解决方案大约慢了 2 到 4 倍。 Given this and the fact that we really like the solution presented by Richard Watson, we have refactored our solution accordingly.鉴于这一点以及我们非常喜欢 Richard Watson 提出的解决方案这一事实,我们相应地重构了我们的解决方案。 I just wanted to make this available for anyone that might benefit from it.我只是想让任何可能从中受益的人都可以使用它。

Our original solution:我们的原始解决方案:

    /// <summary>
    /// Counts the number of occurrences of the specified substring within
    /// the current string.
    /// </summary>
    /// <param name="s">The current string.</param>
    /// <param name="substring">The substring we are searching for.</param>
    /// <param name="aggressiveSearch">Indicates whether or not the algorithm 
    /// should be aggressive in its search behavior (see Remarks). Default 
    /// behavior is non-aggressive.</param>
    /// <remarks>This algorithm has two search modes - aggressive and 
    /// non-aggressive. When in aggressive search mode (aggressiveSearch = 
    /// true), the algorithm will try to match at every possible starting 
    /// character index within the string. When false, all subsequent 
    /// character indexes within a substring match will not be evaluated. 
    /// For example, if the string was 'abbbc' and we were searching for 
    /// the substring 'bb', then aggressive search would find 2 matches 
    /// with starting indexes of 1 and 2. Non aggressive search would find 
    /// just 1 match with starting index at 1. After the match was made, 
    /// the non aggressive search would attempt to make it's next match 
    /// starting at index 3 instead of 2.</remarks>
    /// <returns>The count of occurrences of the substring within the string.</returns>
    public static int CountOccurrences(this string s, string substring, 
        bool aggressiveSearch = false)
    {
        // if s or substring is null or empty, substring cannot be found in s
        if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(substring))
            return 0;

        // if the length of substring is greater than the length of s,
        // substring cannot be found in s
        if (substring.Length > s.Length)
            return 0;

        var sChars = s.ToCharArray();
        var substringChars = substring.ToCharArray();
        var count = 0;
        var sCharsIndex = 0;

        // substring cannot start in s beyond following index
        var lastStartIndex = sChars.Length - substringChars.Length;

        while (sCharsIndex <= lastStartIndex)
        {
            if (sChars[sCharsIndex] == substringChars[0])
            {
                // potential match checking
                var match = true;
                var offset = 1;
                while (offset < substringChars.Length)
                {
                    if (sChars[sCharsIndex + offset] != substringChars[offset])
                    {
                        match = false;
                        break;
                    }
                    offset++;
                }
                if (match)
                {
                    count++;
                    // if aggressive, just advance to next char in s, otherwise, 
                    // skip past the match just found in s
                    sCharsIndex += aggressiveSearch ? 1 : substringChars.Length;
                }
                else
                {
                    // no match found, just move to next char in s
                    sCharsIndex++;
                }
            }
            else
            {
                // no match at current index, move along
                sCharsIndex++;
            }
        }

        return count;
    }

And here is our revised solution:这是我们修改后的解决方案:

    /// <summary>
    /// Counts the number of occurrences of the specified substring within
    /// the current string.
    /// </summary>
    /// <param name="s">The current string.</param>
    /// <param name="substring">The substring we are searching for.</param>
    /// <param name="aggressiveSearch">Indicates whether or not the algorithm 
    /// should be aggressive in its search behavior (see Remarks). Default 
    /// behavior is non-aggressive.</param>
    /// <remarks>This algorithm has two search modes - aggressive and 
    /// non-aggressive. When in aggressive search mode (aggressiveSearch = 
    /// true), the algorithm will try to match at every possible starting 
    /// character index within the string. When false, all subsequent 
    /// character indexes within a substring match will not be evaluated. 
    /// For example, if the string was 'abbbc' and we were searching for 
    /// the substring 'bb', then aggressive search would find 2 matches 
    /// with starting indexes of 1 and 2. Non aggressive search would find 
    /// just 1 match with starting index at 1. After the match was made, 
    /// the non aggressive search would attempt to make it's next match 
    /// starting at index 3 instead of 2.</remarks>
    /// <returns>The count of occurrences of the substring within the string.</returns>
    public static int CountOccurrences(this string s, string substring, 
        bool aggressiveSearch = false)
    {
        // if s or substring is null or empty, substring cannot be found in s
        if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(substring))
            return 0;

        // if the length of substring is greater than the length of s,
        // substring cannot be found in s
        if (substring.Length > s.Length)
            return 0;

        int count = 0, n = 0;
        while ((n = s.IndexOf(substring, n, StringComparison.InvariantCulture)) != -1)
        {
            if (aggressiveSearch)
                n++;
            else
                n += substring.Length;
            count++;
        }

        return count;
    }

My initial take gave me something like:我最初的看法给了我类似的东西:

public static int CountOccurrences(string original, string substring)
{
    if (string.IsNullOrEmpty(substring))
        return 0;
    if (substring.Length == 1)
        return CountOccurrences(original, substring[0]);
    if (string.IsNullOrEmpty(original) ||
        substring.Length > original.Length)
        return 0;
    int substringCount = 0;
    for (int charIndex = 0; charIndex < original.Length; charIndex++)
    {
        for (int subCharIndex = 0, secondaryCharIndex = charIndex; subCharIndex < substring.Length && secondaryCharIndex < original.Length; subCharIndex++, secondaryCharIndex++)
        {
            if (substring[subCharIndex] != original[secondaryCharIndex])
                goto continueOuter;
        }
        if (charIndex + substring.Length > original.Length)
            break;
        charIndex += substring.Length - 1;
        substringCount++;
    continueOuter:
        ;
    }
    return substringCount;
}

public static int CountOccurrences(string original, char @char)
{
    if (string.IsNullOrEmpty(original))
        return 0;
    int substringCount = 0;
    for (int charIndex = 0; charIndex < original.Length; charIndex++)
        if (@char == original[charIndex])
            substringCount++;
    return substringCount;
}

The needle in a haystack approach using replace and division yields 21+ seconds whereas this takes about 15.2.大海捞针方法使用替换和除法产生 21+ 秒,而这大约需要 15.2 秒。

Edit after adding a bit which would add substring.Length - 1 to the charIndex (like it should), it's at 11.6 seconds.添加一点后进行编辑,将substring.Length - 1添加到 charIndex (就像它应该的那样),它是 11.6 秒。

Edit 2: I used a string which had 26 two-character strings, here are the times updated to the same sample texts:编辑 2:我使用了一个包含 26 个双字符字符串的字符串,这里是更新到相同示例文本的时间:

Needle in a haystack (OP's version): 7.8 Seconds大海捞针(OP 版本):7.8 秒

Suggested mechanism: 4.6 seconds.建议机制:4.6 秒。

Edit 3: Adding the single character corner-case, it went to 1.2 seconds.编辑 3:添加单字符角落案例,时间为 1.2 秒。

Edit 4: For context: 50 million iterations were used.编辑 4:对于上下文:使用了 5000 万次迭代。

A generic function for occurrences of strings:字符串出现的通用函数:

public int getNumberOfOccurencies(String inputString, String checkString)
{
    if (checkString.Length > inputString.Length || checkString.Equals("")) { return 0; }
    int lengthDifference = inputString.Length - checkString.Length;
    int occurencies = 0;
    for (int i = 0; i < lengthDifference; i++) {
        if (inputString.Substring(i, checkString.Length).Equals(checkString)) { occurencies++; i += checkString.Length - 1; } }
    return occurencies;
}
string Name = "Very good nice one is very good but is very good nice one this is called the term";
bool valid=true;
int count = 0;
int k=0;
int m = 0;
while (valid)
{
    k = Name.Substring(m,Name.Length-m).IndexOf("good");
    if (k != -1)
    {
        count++;
        m = m + k + 4;
    }
    else
        valid = false;
}
Console.WriteLine(count + " Times accures");
string s = "HOWLYH THIS ACTUALLY WORKSH WOWH";
int count = 0;
for (int i = 0; i < s.Length; i++)
   if (s[i] == 'H') count++;

It just checks every character in the string, if the character is the character you are searching for, add one to count.它只是检查字符串中的每个字符,如果该字符是您要搜索的字符,则加一计数。

If you check out this webpage , 15 different ways of doing this are benchmarked, including using parallel loops.如果您查看此网页,则会对 15 种不同的执行方式进行基准测试,包括使用并行循环。

The fastest way appears to be using either a single threaded for-loop (if you have .Net version < 4.0) or a parallel.for loop (if using .Net > 4.0 with thousands of checks).最快的方法似乎是使用单线程 for 循环(如果您的 .Net 版本 < 4.0)或 parallel.for 循环(如果使用 .Net > 4.0 并进行数千次检查)。

Assuming "ss" is your Search String, "ch" is your character array (if you have more than one char you're looking for), here's the basic gist of the code that had the fastest run time single threaded:假设“ss”是您的搜索字符串,“ch”是您的字符数组(如果您要查找的字符不止一个),以下是运行时间最快的单线程代码的基本要点:

for (int x = 0; x < ss.Length; x++)
{
    for (int y = 0; y < ch.Length; y++)
    {
        for (int a = 0; a < ss[x].Length; a++ )
        {
        if (ss[x][a] == ch[y])
            //it's found. DO what you need to here.
        }
    }
}

The benchmark source code is provided too so you can run your own tests.还提供了基准源代码,因此您可以运行自己的测试。

For the case of a string delimiter (not for the char case, as the subject says):对于字符串分隔符的情况(不适用于 char 情况,如主题所述):
string source = "@@@once@@@upon@@@a@@@time@@@"; string source = "@@@once@@@upon@@@a@@@time@@@";
int count = source.Split(new[] { "@@@" }, StringSplitOptions.RemoveEmptyEntries).Length - 1; int count = source.Split(new[] { "@@@" }, StringSplitOptions.RemoveEmptyEntries).Length - 1;

The poster's original source value's ("/once/upon/a/time/") natural delimiter is a char '/' and responses do explain source.Split(char[]) option though...海报的原始源值 ("/once/upon/a/time/") 自然分隔符是一个字符 '/' 并且响应确实解释了 source.Split(char[]) 选项虽然......

Well as of .NET 5 (Net core 2.1+ & NetStandard 2.1) we have a new iteration speed king.从 .NET 5(Net core 2.1+ 和 NetStandard 2.1)开始,我们有了一个新的迭代速度之王。

"Span<T>" https://docs.microsoft.com/en-us/dotnet/api/system.span-1?view=net-5.0 "Span<T>" https://docs.microsoft.com/en-us/dotnet/api/system.span-1?view=net-5.0

and String has a built-in member that returns us a Span<Char>并且 String 有一个内置成员,它返回一个 Span<Char>

int count = 0;
foreach( var c in source.AsSpan())
{
    if (c == '/')
        count++;
}

My tests show 62% faster than a straight foreach.我的测试显示比直接 foreach 快 62%。 I also compared to a for() loop on a Span<T>[i], as well as a few others posted here.我还与 Span<T>[i] 上的 for() 循环以及此处发布的其他一些循环进行了比较。 Note that the reverse for() iteration on a String seems to run slower now than a straight foreach.请注意,String 上的反向 for() 迭代现在似乎比直接 foreach 运行得慢。

Starting test, 10000000 iterations
(base) foreach =   673 ms

fastest to slowest
foreach Span =   252 ms   62.6%
  Span [i--] =   282 ms   58.1%
  Span [i++] =   402 ms   40.3%
   for [i++] =   454 ms   32.5%
   for [i--] =   867 ms  -28.8%
     Replace =  1905 ms -183.1%
       Split =  2109 ms -213.4%
  Linq.Count =  3797 ms -464.2%

Code Link代码链接

Split (may) wins over IndexOf (for strings). Split (可能)胜过IndexOf (对于字符串)。

The benchmark above seems to indicate that Richard Watson is the fastest for string which is wrong (maybe the difference comes from our test data but it seems strange anyway for the reasons below).上面的基准测试似乎表明 Richard Watson 是最快的字符串,这是错误的(也许差异来自我们的测试数据,但由于以下原因它看起来很奇怪)。

If we look a bit deeper in the implementation of these methods in .NET (for Luke H, Richard Watson methods),如果我们更深入地了解 .NET 中这些方法的实现(对于 Luke H,Richard Watson 方法),

  • IndexOf is culture depending, it will try to retrieve/create ReadOnlySpan, check if it has to ignore case etc.. and then finally do the unsafe / native call. IndexOf取决于文化,它将尝试检索/创建 ReadOnlySpan,检查它是否必须忽略大小写等。然后最后执行不安全/本机调用。
  • Split is able to handle several separators and has some StringSplitOptions and has to create the string[] array and fill it with the split result (so do some substring). Split能够处理多个分隔符并具有一些 StringSplitOptions 并且必须创建 string[] 数组并用拆分结果填充它(一些子字符串也是如此)。 Depending on the number of string occurrence Split may be faster than IndexOf.根据字符串出现的次数,Split 可能比 IndexOf 更快。

By the way, I made a simplified version of IndexOf (which could be faster if I used pointer and unsafe but unchecked should be ok for most) which is faster by at least a 4 order of magnitude .顺便说一句,我制作了一个简化版本的 IndexOf(如果我使用指针和不安全但未选中对大多数人来说应该没问题,它可能会更快)至少快了4 个数量级

Benchmark (source on GitHub )基准(来源GitHub

Done by searching either a common word (the) or a small sentence in Shakespeare Richard III .通过搜索Shakespeare Richard III中的常用词 (the) 或小句子来完成。

Method方法 Mean意思 Error错误 StdDev标准偏差 Ratio比率
Richard_LongInLong Richard_LongInLong 67.721 us 67.721 我们 1.0278 us 1.0278 我们 0.9614 us 0.9614 我们 1.00 1.00
Luke_LongInLong Luke_LongInLong 1.960 us 1.960 我们 0.0381 us 0.0381 我们 0.0637 us 0.0637 我们 0.03 0.03
Fab_LongInLong Fab_LongInLong 1.198 us 1.198 我们 0.0160 us 0.0160 我们 0.0142 us 0.0142 我们 0.02 0.02
-------------------- ---------------------- -----------: ----------: ----------: ----------: ----------: ----------: ------: ------:
Richard_ShortInLong Richard_ShortInLong 104.771 us 104.771 我们 2.8117 us 2.8117 我们 7.9304 us 7.9304 我们 1.00 1.00
Luke_ShortInLong Luke_ShortInLong 2.971 us 2.971 我们 0.0594 us 0.0594 我们 0.0813 us 0.0813 我们 0.03 0.03
Fab_ShortInLong Fab_ShortInLong 2.206 us 2.206 我们 0.0419 us 0.0419 我们 0.0411 us 0.0411 我们 0.02 0.02
--------------------- ---------------------- ----------: ----------: ---------: ----------: ---------: ----------: ------: ------:
Richard_ShortInShort Richard_ShortInShort 115.53 ns 115.53 纳秒 1.359 ns 1.359纳秒 1.135 ns 1.135 纳秒 1.00 1.00
Luke_ShortInShort Luke_ShortInShort 52.46 ns 52.46 纳秒 0.970 ns 0.970 纳秒 0.908 ns 0.908 纳秒 0.45 0.45
Fab_ShortInShort Fab_ShortInShort 28.47 ns 28.47 纳秒 0.552 ns 0.552 纳秒 0.542 ns 0.542 纳秒 0.25 0.25
public int GetOccurrences(string input, string needle)
{
    int count = 0;
    unchecked
    {
        if (string.IsNullOrEmpty(input) || string.IsNullOrEmpty(needle))
        {
            return 0;
        }

        for (var i = 0; i < input.Length - needle.Length + 1; i++)
        {
            var c = input[i];
            if (c == needle[0])
            {
                for (var index = 0; index < needle.Length; index++)
                {
                    c = input[i + index];
                    var n = needle[index];

                    if (c != n)
                    {
                        break;
                    }
                    else if (index == needle.Length - 1)
                    {
                        count++;
                    }
                }
            }
        }
    }

    return count;
}
str="aaabbbbjjja";
int count = 0;
int size = str.Length;

string[] strarray = new string[size];
for (int i = 0; i < str.Length; i++)
{
    strarray[i] = str.Substring(i, 1);
}
Array.Sort(strarray);
str = "";
for (int i = 0; i < strarray.Length - 1; i++)
{

    if (strarray[i] == strarray[i + 1])
    {

        count++;
    }
    else
    {
        count++;
        str = str + strarray[i] + count;
        count = 0;
    }

}
count++;
str = str + strarray[strarray.Length - 1] + count;

This is for counting the character occurance.这是用于计算字符出现的次数。 For this example output will be "a4b4j3"对于此示例,输出将为“a4b4j3”

using System.Linq;使用 System.Linq;

int CountOf => "A::BC::D".Split("::").Length - 1; int CountOf => "A::BC::D".Split("::").Length - 1;

**to count char or string ** **计算字符或字符串**

 string st = "asdfasdfasdfsadfasdf/asdfasdfas/dfsdfsdafsdfsd/fsadfasdf/dff";
        int count = 0;
        int location = 0;
       
        while (st.IndexOf("/", location + 1) > 0)
        {
                count++;
                location = st.IndexOf("/", location + 1);
        }
        MessageBox.Show(count.ToString());

Looking for char counts is a lot different than looking for string counts.查找char数与查找string数有很大不同。 Also it depends if you want to be able to check more than one or not.这也取决于您是否希望能够检查多个。 If you want to check a variety of different char counts, something like this can work:如果你想检查各种不同的char数,像这样的东西可以工作:

var charCounts =
   haystack
   .GroupBy(c => c)
   .ToDictionary(g => g.Key, g => g.Count());

var needleCount = charCounts.ContainsKey(needle) ? charCounts[needle] : 0;

Note 1: grouping into a dictionary is useful enough that it makes a lot of sense to write a GroupToDictionary extension method for it.注意 1:分组到字典中非常有用,因此为它编写GroupToDictionary扩展方法很有意义。

Note 2: it can also be useful to have your own implementation of a dictionary that allows for default values and then you could get 0 for non-existent keys automatically.注意 2:您自己实现一个允许默认值的字典也很有用,然后您可以自动为不存在的键获取0

As of .NET 7, we have allocation-free (and highly optimized) Regex APIs.从 .NET 7 开始,我们拥有免分配(且高度优化)的正则表达式 API。 Counting is especially easy and efficient.计数特别容易和高效。

    var input = "abcd abcabc ababc";
    var result = Regex.Count(input: input, pattern: "abc"); // 4

When matching dynamic patterns, remember to escape them:匹配动态模式时,请记住对它们进行转义:

public static int CountOccurences(string input, string pattern)
{
    pattern = Regex.Escape(pattern); // Aww, no way to avoid heap allocations here

    var result = Regex.Count(input: input, pattern: pattern);
    return result;
}

And, as a bonus for fixed patterns, .NET 7 introduces analyzers that help convert the regex string to source-generated code.而且,作为固定模式的奖励,.NET 7 引入了有助于将正则表达式字符串转换为源代码生成代码的分析器。 Not only does this avoid the runtime compilation overhead for the regex, but it also provides very readable code that shows how it is implemented.这不仅避免了正则表达式的运行时编译开销,而且还提供了非常易读的代码来展示它是如何实现的。 In fact, that code is generally at least as efficient as any alternative you would have written manually.事实上,该代码通常至少与您手动编写的任何替代方案一样高效。

If your regex call is eligible, the analyzer will give a hint.如果您的正则表达式调用符合条件,分析器将给出提示。 Simply choose "Convert to 'GeneratedRegexAttribute`" and enjoy the result:只需选择“转换为‘GeneratedRegexAttribute`”并享受结果:

[GeneratedRegex("abc")]
private static partial Regex MyRegex(); // Go To Definition to see the generated code

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM