[英]Count how many occurrences of substrings within a string without counting duplicates
[英]How would you count occurrences of a string (actually a char) within a string?
我正在做一些事情,我意识到我想计算我可以在一个字符串中找到多少/
s,然后让我震惊的是,有几种方法可以做到这一点,但无法决定最好的(或最简单的)曾是。
目前我正在做类似的事情:
string source = "/once/upon/a/time/";
int count = source.Length - source.Replace("/", "").Length;
但我一点都不喜欢,有人要吗?
我真的不想为此挖掘RegEx
,对吗?
我知道我的字符串将包含我正在搜索的术语,因此您可以假设...
当然对于长度 > 1的字符串,
string haystack = "/once/upon/a/time";
string needle = "/";
int needleCount = ( haystack.Length - haystack.Replace(needle,"").Length ) / needle.Length;
如果您使用 .NET 3.5,您可以使用 LINQ 在单行中执行此操作:
int count = source.Count(f => f == '/');
如果你不想使用 LINQ,你可以这样做:
int count = source.Split('/').Length - 1;
您可能会惊讶地发现,您的原始技术似乎比其中任何一种都快 30%! 我刚刚用“/once/upon/a/time/”做了一个快速基准测试,结果如下:
你原来的 = 12s
source.Count = 19s
source.Split = 17s
foreach(来自 bobwienholt 的回答)= 10s
(时间为 50,000,000 次迭代,因此您不太可能注意到现实世界中的太大差异。)
string source = "/once/upon/a/time/";
int count = 0;
foreach (char c in source)
if (c == '/') count++;
必须比source.Replace()
本身更快。
int count = new Regex(Regex.Escape(needle)).Matches(haystack).Count;
如果您希望能够搜索整个字符串,而不仅仅是字符:
src.Select((c, i) => src.Substring(i))
.Count(sub => sub.StartsWith(target))
读作“对于字符串中的每个字符,将从该字符开始的字符串的其余部分作为子字符串;如果它以目标字符串开头,则计算它。”
我做了一些研究,发现Richard Watson 的解决方案在大多数情况下是最快的。 这是帖子中每个解决方案结果的表格(使用正则表达式的除外,因为它在解析像“test{test”这样的字符串时会抛出异常)
Name | Short/char | Long/char | Short/short| Long/short | Long/long |
Inspite | 134| 1853| 95| 1146| 671|
LukeH_1 | 346| 4490| N/A| N/A| N/A|
LukeH_2 | 152| 1569| 197| 2425| 2171|
Bobwienholt | 230| 3269| N/A| N/A| N/A|
Richard Watson| 33| 298| 146| 737| 543|
StefanosKargas| N/A| N/A| 681| 11884| 12486|
您可以看到,如果在短字符串(10-50 个字符)中找到短子字符串(1-5 个字符)的出现次数,则首选原始算法。
此外,对于多字符子字符串,您应该使用以下代码(基于Richard Watson 的解决方案)
int count = 0, n = 0;
if(substring != "")
{
while ((n = source.IndexOf(substring, n, StringComparison.InvariantCulture)) != -1)
{
n += substring.Length;
++count;
}
}
LINQ 适用于所有集合,并且由于字符串只是字符的集合,那么这个漂亮的小单行怎么样:
var count = source.Count(c => c == '/');
确保你有using System.Linq;
在代码文件的顶部,因为.Count
是来自该命名空间的扩展方法。
string source = "/once/upon/a/time/";
int count = 0;
int n = 0;
while ((n = source.IndexOf('/', n)) != -1)
{
n++;
count++;
}
在我的计算机上,它比 5000 万次迭代的 for-every-character 解决方案快约 2 秒。
2013年修订:
将字符串更改为 char[] 并遍历它。 将 50m 迭代的总时间再缩短一两秒!
char[] testchars = source.ToCharArray();
foreach (char c in testchars)
{
if (c == '/')
count++;
}
这仍然更快:
char[] testchars = source.ToCharArray();
int length = testchars.Length;
for (int n = 0; n < length; n++)
{
if (testchars[n] == '/')
count++;
}
为了更好地衡量,从数组末尾迭代到 0 似乎是最快的,大约 5%。
int length = testchars.Length;
for (int n = length-1; n >= 0; n--)
{
if (testchars[n] == '/')
count++;
}
我想知道为什么这可能并且正在谷歌搜索(我记得一些关于反向迭代更快的事情),并遇到了这个问题,该问题已经烦人地使用字符串到字符 [] 技术。 不过,我认为在这种情况下逆转技巧是新的。
这些都只适用于单字符搜索词...
countOccurences("the", "the answer is the answer");
int countOccurences(string needle, string haystack)
{
return (haystack.Length - haystack.Replace(needle,"").Length) / needle.Length;
}
对于更长的针头可能会更好......
但必须有更优雅的方式。 :)
编辑:
source.Split('/').Length-1
在 C# 中,一个不错的 String SubString 计数器就是这个出人意料的棘手家伙:
public static int CCount(String haystack, String needle)
{
return haystack.Split(new[] { needle }, StringSplitOptions.None).Length - 1;
}
Regex.Matches(input, Regex.Escape("stringToMatch")).Count
private int CountWords(string text, string word) {
int count = (text.Length - text.Replace(word, "").Length) / word.Length;
return count;
}
因为原始解决方案对于字符来说是最快的,所以我想它也适用于字符串。 所以这是我的贡献。
对于上下文:我在日志文件中寻找诸如“失败”和“成功”之类的词。
Gr, 本
string s = "65 fght 6565 4665 hjk";
int count = 0;
foreach (Match m in Regex.Matches(s, "65"))
count++;
public static int GetNumSubstringOccurrences(string text, string search)
{
int num = 0;
int pos = 0;
if (!string.IsNullOrEmpty(text) && !string.IsNullOrEmpty(search))
{
while ((pos = text.IndexOf(search, pos)) > -1)
{
num ++;
pos += search.Length;
}
}
return num;
}
对于任何想要使用 String 扩展方法的人,
这是我使用的基于最佳已发布答案的内容:
public static class StringExtension
{
/// <summary> Returns the number of occurences of a string within a string, optional comparison allows case and culture control. </summary>
public static int Occurrences(this System.String input, string value, StringComparison stringComparisonType = StringComparison.Ordinal)
{
if (String.IsNullOrEmpty(value)) return 0;
int count = 0;
int position = 0;
while ((position = input.IndexOf(value, position, stringComparisonType)) != -1)
{
position += value.Length;
count += 1;
}
return count;
}
/// <summary> Returns the number of occurences of a single character within a string. </summary>
public static int Occurrences(this System.String input, char value)
{
int count = 0;
foreach (char c in input) if (c == value) count += 1;
return count;
}
}
我认为最简单的方法是使用正则表达式。 通过这种方式,您可以获得与使用 myVar.Split('x') 相同的拆分计数,但在多字符设置中。
string myVar = "do this to count the number of words in my wording so that I can word it up!";
int count = Regex.Split(myVar, "word").Length;
string search = "/string";
var occurrences = (regex.Match(search, @"\/")).Count;
每次程序准确地找到“/s”(区分大小写)时,这将计数,并且出现的次数将存储在变量“occurrences”中
我觉得我们缺少某些类型的子字符串计数,比如不安全的逐字节比较。 我把原始海报的方法和我能想到的任何方法放在一起。
这些是我做的字符串扩展。
namespace Example
{
using System;
using System.Text;
public static class StringExtensions
{
public static int CountSubstr(this string str, string substr)
{
return (str.Length - str.Replace(substr, "").Length) / substr.Length;
}
public static int CountSubstr(this string str, char substr)
{
return (str.Length - str.Replace(substr.ToString(), "").Length);
}
public static int CountSubstr2(this string str, string substr)
{
int substrlen = substr.Length;
int lastIndex = str.IndexOf(substr, 0, StringComparison.Ordinal);
int count = 0;
while (lastIndex != -1)
{
++count;
lastIndex = str.IndexOf(substr, lastIndex + substrlen, StringComparison.Ordinal);
}
return count;
}
public static int CountSubstr2(this string str, char substr)
{
int lastIndex = str.IndexOf(substr, 0);
int count = 0;
while (lastIndex != -1)
{
++count;
lastIndex = str.IndexOf(substr, lastIndex + 1);
}
return count;
}
public static int CountChar(this string str, char substr)
{
int length = str.Length;
int count = 0;
for (int i = 0; i < length; ++i)
if (str[i] == substr)
++count;
return count;
}
public static int CountChar2(this string str, char substr)
{
int count = 0;
foreach (var c in str)
if (c == substr)
++count;
return count;
}
public static unsafe int CountChar3(this string str, char substr)
{
int length = str.Length;
int count = 0;
fixed (char* chars = str)
{
for (int i = 0; i < length; ++i)
if (*(chars + i) == substr)
++count;
}
return count;
}
public static unsafe int CountChar4(this string str, char substr)
{
int length = str.Length;
int count = 0;
fixed (char* chars = str)
{
for (int i = length - 1; i >= 0; --i)
if (*(chars + i) == substr)
++count;
}
return count;
}
public static unsafe int CountSubstr3(this string str, string substr)
{
int length = str.Length;
int substrlen = substr.Length;
int count = 0;
fixed (char* strc = str)
{
fixed (char* substrc = substr)
{
int n = 0;
for (int i = 0; i < length; ++i)
{
if (*(strc + i) == *(substrc + n))
{
++n;
if (n == substrlen)
{
++count;
n = 0;
}
}
else
n = 0;
}
}
}
return count;
}
public static int CountSubstr3(this string str, char substr)
{
return CountSubstr3(str, substr.ToString());
}
public static unsafe int CountSubstr4(this string str, string substr)
{
int length = str.Length;
int substrLastIndex = substr.Length - 1;
int count = 0;
fixed (char* strc = str)
{
fixed (char* substrc = substr)
{
int n = substrLastIndex;
for (int i = length - 1; i >= 0; --i)
{
if (*(strc + i) == *(substrc + n))
{
if (--n == -1)
{
++count;
n = substrLastIndex;
}
}
else
n = substrLastIndex;
}
}
}
return count;
}
public static int CountSubstr4(this string str, char substr)
{
return CountSubstr4(str, substr.ToString());
}
}
}
接下来是测试代码...
static void Main()
{
const char matchA = '_';
const string matchB = "and";
const string matchC = "muchlongerword";
const string testStrA = "_and_d_e_banna_i_o___pfasd__and_d_e_banna_i_o___pfasd_";
const string testStrB = "and sdf and ans andeians andano ip and and sdf and ans andeians andano ip and";
const string testStrC =
"muchlongerword amuchlongerworsdfmuchlongerwordsdf jmuchlongerworijv muchlongerword sdmuchlongerword dsmuchlongerword";
const int testSize = 1000000;
Console.WriteLine(testStrA.CountSubstr('_'));
Console.WriteLine(testStrA.CountSubstr2('_'));
Console.WriteLine(testStrA.CountSubstr3('_'));
Console.WriteLine(testStrA.CountSubstr4('_'));
Console.WriteLine(testStrA.CountChar('_'));
Console.WriteLine(testStrA.CountChar2('_'));
Console.WriteLine(testStrA.CountChar3('_'));
Console.WriteLine(testStrA.CountChar4('_'));
Console.WriteLine(testStrB.CountSubstr("and"));
Console.WriteLine(testStrB.CountSubstr2("and"));
Console.WriteLine(testStrB.CountSubstr3("and"));
Console.WriteLine(testStrB.CountSubstr4("and"));
Console.WriteLine(testStrC.CountSubstr("muchlongerword"));
Console.WriteLine(testStrC.CountSubstr2("muchlongerword"));
Console.WriteLine(testStrC.CountSubstr3("muchlongerword"));
Console.WriteLine(testStrC.CountSubstr4("muchlongerword"));
var timer = new Stopwatch();
timer.Start();
for (int i = 0; i < testSize; ++i)
testStrA.CountSubstr(matchA);
timer.Stop();
Console.WriteLine("CS1 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrB.CountSubstr(matchB);
timer.Stop();
Console.WriteLine("CS1 and: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrC.CountSubstr(matchC);
timer.Stop();
Console.WriteLine("CS1 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrA.CountSubstr2(matchA);
timer.Stop();
Console.WriteLine("CS2 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrB.CountSubstr2(matchB);
timer.Stop();
Console.WriteLine("CS2 and: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrC.CountSubstr2(matchC);
timer.Stop();
Console.WriteLine("CS2 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrA.CountSubstr3(matchA);
timer.Stop();
Console.WriteLine("CS3 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrB.CountSubstr3(matchB);
timer.Stop();
Console.WriteLine("CS3 and: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrC.CountSubstr3(matchC);
timer.Stop();
Console.WriteLine("CS3 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrA.CountSubstr4(matchA);
timer.Stop();
Console.WriteLine("CS4 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrB.CountSubstr4(matchB);
timer.Stop();
Console.WriteLine("CS4 and: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrC.CountSubstr4(matchC);
timer.Stop();
Console.WriteLine("CS4 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrA.CountChar(matchA);
timer.Stop();
Console.WriteLine("CC1 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrA.CountChar2(matchA);
timer.Stop();
Console.WriteLine("CC2 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrA.CountChar3(matchA);
timer.Stop();
Console.WriteLine("CC3 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
timer.Restart();
for (int i = 0; i < testSize; ++i)
testStrA.CountChar4(matchA);
timer.Stop();
Console.WriteLine("CC4 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
}
结果:CSX 对应于 CountSubstrX,CCX 对应于 CountCharX。 “chr”在字符串中搜索“_”,“and”在字符串中搜索“and”,而“mlw”在字符串中搜索“muchlongerword”
CS1 chr: 824.123ms
CS1 and: 586.1893ms
CS1 mlw: 486.5414ms
CS2 chr: 127.8941ms
CS2 and: 806.3918ms
CS2 mlw: 497.318ms
CS3 chr: 201.8896ms
CS3 and: 124.0675ms
CS3 mlw: 212.8341ms
CS4 chr: 81.5183ms
CS4 and: 92.0615ms
CS4 mlw: 116.2197ms
CC1 chr: 66.4078ms
CC2 chr: 64.0161ms
CC3 chr: 65.9013ms
CC4 chr: 65.8206ms
最后,我有一个包含 360 万个字符的文件。 它被“derp adfderdserp dfaerpderp deasderp”重复了100,000 次。 我使用上述方法在文件中搜索“derp”是这些结果的 100 倍。
CS1Derp: 1501.3444ms
CS2Derp: 1585.797ms
CS3Derp: 376.0937ms
CS4Derp: 271.1663ms
所以我的第四种方法绝对是赢家,但实际上,如果一个 360 万个字符文件 100 次只花费 1586 毫秒作为最坏的情况,那么所有这些都可以忽略不计。
顺便说一句,我还使用 CountSubstr 和 CountChar 方法 100 次扫描了 360 万个字符文件中的“d”字符。 结果...
CS1 d : 2606.9513ms
CS2 d : 339.7942ms
CS3 d : 960.281ms
CS4 d : 233.3442ms
CC1 d : 302.4122ms
CC2 d : 280.7719ms
CC3 d : 299.1125ms
CC4 d : 292.9365ms
原来的海报方法对大海捞针的单字针是很不好的。
注意:所有值都更新为发布版本输出。 我第一次发布时不小心忘记了在发布模式上构建。 我的一些陈述已被修改。
string source = "/once/upon/a/time/";
int count = 0, n = 0;
while ((n = source.IndexOf('/', n) + 1) != 0) count++;
理查德沃森的答案的一个变体,随着字符在字符串中出现的次数越多,代码越少,效率提高的速度稍快!
虽然我必须说,在没有对每个场景进行广泛测试的情况下,我确实看到使用以下方法显着提高了速度:
int count = 0;
for (int n = 0; n < source.Length; n++) if (source[n] == '/') count++;
var conditionalStatement = conditionSetting.Value;
//order of replace matters, remove == before =, incase of ===
conditionalStatement = conditionalStatement.Replace("==", "~").Replace("!=", "~").Replace('=', '~').Replace('!', '~').Replace('>', '~').Replace('<', '~').Replace(">=", "~").Replace("<=", "~");
var listOfValidConditions = new List<string>() { "!=", "==", ">", "<", ">=", "<=" };
if (conditionalStatement.Count(x => x == '~') != 1)
{
result.InvalidFieldList.Add(new KeyFieldData(batch.DECurrentField, "The IsDoubleKeyCondition does not contain a supported conditional statement. Contact System Administrator."));
result.Status = ValidatorStatus.Fail;
return result;
}
需要做一些类似于从字符串测试条件语句的事情。
用单个字符替换我正在寻找的内容并计算单个字符的实例。
显然,在发生这种情况之前,需要检查您正在使用的单个字符是否存在于字符串中,以避免计数不正确。
字符串中的字符串:
在“..JD JD JD JD 等等等等JDJDJDJDJDJDJDJD 等等”中找到“etc”
var strOrigin = " .. JD JD JD JD etc. and etc. JDJDJDJDJDJDJDJD and etc.";
var searchStr = "etc";
int count = (strOrigin.Length - strOrigin.Replace(searchStr, "").Length)/searchStr.Length.
在丢弃这个不健全/笨拙的人之前检查性能......
以为我会将我的扩展方法扔进戒指中(有关更多信息,请参阅评论)。 我没有做过任何正式的基准测试,但我认为在大多数情况下它必须非常快。
编辑:好的 - 所以这个问题让我想知道我们当前实现的性能将如何与这里提供的一些解决方案相提并论。 我决定做一点基准测试,发现我们的解决方案非常符合Richard Watson提供的解决方案的性能,直到您使用大字符串 (100 Kb +)、大子字符串 (32 Kb + ) 和许多嵌入的重复 (10K +)。 那时我们的解决方案大约慢了 2 到 4 倍。 鉴于这一点以及我们非常喜欢 Richard Watson 提出的解决方案这一事实,我们相应地重构了我们的解决方案。 我只是想让任何可能从中受益的人都可以使用它。
我们的原始解决方案:
/// <summary>
/// Counts the number of occurrences of the specified substring within
/// the current string.
/// </summary>
/// <param name="s">The current string.</param>
/// <param name="substring">The substring we are searching for.</param>
/// <param name="aggressiveSearch">Indicates whether or not the algorithm
/// should be aggressive in its search behavior (see Remarks). Default
/// behavior is non-aggressive.</param>
/// <remarks>This algorithm has two search modes - aggressive and
/// non-aggressive. When in aggressive search mode (aggressiveSearch =
/// true), the algorithm will try to match at every possible starting
/// character index within the string. When false, all subsequent
/// character indexes within a substring match will not be evaluated.
/// For example, if the string was 'abbbc' and we were searching for
/// the substring 'bb', then aggressive search would find 2 matches
/// with starting indexes of 1 and 2. Non aggressive search would find
/// just 1 match with starting index at 1. After the match was made,
/// the non aggressive search would attempt to make it's next match
/// starting at index 3 instead of 2.</remarks>
/// <returns>The count of occurrences of the substring within the string.</returns>
public static int CountOccurrences(this string s, string substring,
bool aggressiveSearch = false)
{
// if s or substring is null or empty, substring cannot be found in s
if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(substring))
return 0;
// if the length of substring is greater than the length of s,
// substring cannot be found in s
if (substring.Length > s.Length)
return 0;
var sChars = s.ToCharArray();
var substringChars = substring.ToCharArray();
var count = 0;
var sCharsIndex = 0;
// substring cannot start in s beyond following index
var lastStartIndex = sChars.Length - substringChars.Length;
while (sCharsIndex <= lastStartIndex)
{
if (sChars[sCharsIndex] == substringChars[0])
{
// potential match checking
var match = true;
var offset = 1;
while (offset < substringChars.Length)
{
if (sChars[sCharsIndex + offset] != substringChars[offset])
{
match = false;
break;
}
offset++;
}
if (match)
{
count++;
// if aggressive, just advance to next char in s, otherwise,
// skip past the match just found in s
sCharsIndex += aggressiveSearch ? 1 : substringChars.Length;
}
else
{
// no match found, just move to next char in s
sCharsIndex++;
}
}
else
{
// no match at current index, move along
sCharsIndex++;
}
}
return count;
}
这是我们修改后的解决方案:
/// <summary>
/// Counts the number of occurrences of the specified substring within
/// the current string.
/// </summary>
/// <param name="s">The current string.</param>
/// <param name="substring">The substring we are searching for.</param>
/// <param name="aggressiveSearch">Indicates whether or not the algorithm
/// should be aggressive in its search behavior (see Remarks). Default
/// behavior is non-aggressive.</param>
/// <remarks>This algorithm has two search modes - aggressive and
/// non-aggressive. When in aggressive search mode (aggressiveSearch =
/// true), the algorithm will try to match at every possible starting
/// character index within the string. When false, all subsequent
/// character indexes within a substring match will not be evaluated.
/// For example, if the string was 'abbbc' and we were searching for
/// the substring 'bb', then aggressive search would find 2 matches
/// with starting indexes of 1 and 2. Non aggressive search would find
/// just 1 match with starting index at 1. After the match was made,
/// the non aggressive search would attempt to make it's next match
/// starting at index 3 instead of 2.</remarks>
/// <returns>The count of occurrences of the substring within the string.</returns>
public static int CountOccurrences(this string s, string substring,
bool aggressiveSearch = false)
{
// if s or substring is null or empty, substring cannot be found in s
if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(substring))
return 0;
// if the length of substring is greater than the length of s,
// substring cannot be found in s
if (substring.Length > s.Length)
return 0;
int count = 0, n = 0;
while ((n = s.IndexOf(substring, n, StringComparison.InvariantCulture)) != -1)
{
if (aggressiveSearch)
n++;
else
n += substring.Length;
count++;
}
return count;
}
我最初的看法给了我类似的东西:
public static int CountOccurrences(string original, string substring)
{
if (string.IsNullOrEmpty(substring))
return 0;
if (substring.Length == 1)
return CountOccurrences(original, substring[0]);
if (string.IsNullOrEmpty(original) ||
substring.Length > original.Length)
return 0;
int substringCount = 0;
for (int charIndex = 0; charIndex < original.Length; charIndex++)
{
for (int subCharIndex = 0, secondaryCharIndex = charIndex; subCharIndex < substring.Length && secondaryCharIndex < original.Length; subCharIndex++, secondaryCharIndex++)
{
if (substring[subCharIndex] != original[secondaryCharIndex])
goto continueOuter;
}
if (charIndex + substring.Length > original.Length)
break;
charIndex += substring.Length - 1;
substringCount++;
continueOuter:
;
}
return substringCount;
}
public static int CountOccurrences(string original, char @char)
{
if (string.IsNullOrEmpty(original))
return 0;
int substringCount = 0;
for (int charIndex = 0; charIndex < original.Length; charIndex++)
if (@char == original[charIndex])
substringCount++;
return substringCount;
}
大海捞针方法使用替换和除法产生 21+ 秒,而这大约需要 15.2 秒。
添加一点后进行编辑,将substring.Length - 1
添加到 charIndex (就像它应该的那样),它是 11.6 秒。
编辑 2:我使用了一个包含 26 个双字符字符串的字符串,这里是更新到相同示例文本的时间:
大海捞针(OP 版本):7.8 秒
建议机制:4.6 秒。
编辑 3:添加单字符角落案例,时间为 1.2 秒。
编辑 4:对于上下文:使用了 5000 万次迭代。
字符串出现的通用函数:
public int getNumberOfOccurencies(String inputString, String checkString)
{
if (checkString.Length > inputString.Length || checkString.Equals("")) { return 0; }
int lengthDifference = inputString.Length - checkString.Length;
int occurencies = 0;
for (int i = 0; i < lengthDifference; i++) {
if (inputString.Substring(i, checkString.Length).Equals(checkString)) { occurencies++; i += checkString.Length - 1; } }
return occurencies;
}
string Name = "Very good nice one is very good but is very good nice one this is called the term";
bool valid=true;
int count = 0;
int k=0;
int m = 0;
while (valid)
{
k = Name.Substring(m,Name.Length-m).IndexOf("good");
if (k != -1)
{
count++;
m = m + k + 4;
}
else
valid = false;
}
Console.WriteLine(count + " Times accures");
string s = "HOWLYH THIS ACTUALLY WORKSH WOWH";
int count = 0;
for (int i = 0; i < s.Length; i++)
if (s[i] == 'H') count++;
它只是检查字符串中的每个字符,如果该字符是您要搜索的字符,则加一计数。
如果您查看此网页,则会对 15 种不同的执行方式进行基准测试,包括使用并行循环。
最快的方法似乎是使用单线程 for 循环(如果您的 .Net 版本 < 4.0)或 parallel.for 循环(如果使用 .Net > 4.0 并进行数千次检查)。
假设“ss”是您的搜索字符串,“ch”是您的字符数组(如果您要查找的字符不止一个),以下是运行时间最快的单线程代码的基本要点:
for (int x = 0; x < ss.Length; x++)
{
for (int y = 0; y < ch.Length; y++)
{
for (int a = 0; a < ss[x].Length; a++ )
{
if (ss[x][a] == ch[y])
//it's found. DO what you need to here.
}
}
}
还提供了基准源代码,因此您可以运行自己的测试。
对于字符串分隔符的情况(不适用于 char 情况,如主题所述):
string source = "@@@once@@@upon@@@a@@@time@@@";
int count = source.Split(new[] { "@@@" }, StringSplitOptions.RemoveEmptyEntries).Length - 1;
海报的原始源值 ("/once/upon/a/time/") 自然分隔符是一个字符 '/' 并且响应确实解释了 source.Split(char[]) 选项虽然......
从 .NET 5(Net core 2.1+ 和 NetStandard 2.1)开始,我们有了一个新的迭代速度之王。
"Span<T>" https://docs.microsoft.com/en-us/dotnet/api/system.span-1?view=net-5.0
并且 String 有一个内置成员,它返回一个 Span<Char>
int count = 0;
foreach( var c in source.AsSpan())
{
if (c == '/')
count++;
}
我的测试显示比直接 foreach 快 62%。 我还与 Span<T>[i] 上的 for() 循环以及此处发布的其他一些循环进行了比较。 请注意,String 上的反向 for() 迭代现在似乎比直接 foreach 运行得慢。
Starting test, 10000000 iterations
(base) foreach = 673 ms
fastest to slowest
foreach Span = 252 ms 62.6%
Span [i--] = 282 ms 58.1%
Span [i++] = 402 ms 40.3%
for [i++] = 454 ms 32.5%
for [i--] = 867 ms -28.8%
Replace = 1905 ms -183.1%
Split = 2109 ms -213.4%
Linq.Count = 3797 ms -464.2%
Split
(可能)胜过IndexOf
(对于字符串)。上面的基准测试似乎表明 Richard Watson 是最快的字符串,这是错误的(也许差异来自我们的测试数据,但由于以下原因它看起来很奇怪)。
如果我们更深入地了解 .NET 中这些方法的实现(对于 Luke H,Richard Watson 方法),
IndexOf
取决于文化,它将尝试检索/创建 ReadOnlySpan,检查它是否必须忽略大小写等。然后最后执行不安全/本机调用。Split
能够处理多个分隔符并具有一些 StringSplitOptions 并且必须创建 string[] 数组并用拆分结果填充它(一些子字符串也是如此)。 根据字符串出现的次数,Split 可能比 IndexOf 更快。顺便说一句,我制作了一个简化版本的 IndexOf(如果我使用指针和不安全但未选中对大多数人来说应该没问题,它可能会更快)至少快了4 个数量级。
通过搜索Shakespeare Richard III中的常用词 (the) 或小句子来完成。
方法 | 意思 | 错误 | 标准偏差 | 比率 |
---|---|---|---|---|
Richard_LongInLong | 67.721 我们 | 1.0278 我们 | 0.9614 我们 | 1.00 |
Luke_LongInLong | 1.960 我们 | 0.0381 我们 | 0.0637 我们 | 0.03 |
Fab_LongInLong | 1.198 我们 | 0.0160 我们 | 0.0142 我们 | 0.02 |
---------------------- | ----------: | ----------: | ----------: | ------: |
Richard_ShortInLong | 104.771 我们 | 2.8117 我们 | 7.9304 我们 | 1.00 |
Luke_ShortInLong | 2.971 我们 | 0.0594 我们 | 0.0813 我们 | 0.03 |
Fab_ShortInLong | 2.206 我们 | 0.0419 我们 | 0.0411 我们 | 0.02 |
---------------------- | ----------: | ----------: | ----------: | ------: |
Richard_ShortInShort | 115.53 纳秒 | 1.359纳秒 | 1.135 纳秒 | 1.00 |
Luke_ShortInShort | 52.46 纳秒 | 0.970 纳秒 | 0.908 纳秒 | 0.45 |
Fab_ShortInShort | 28.47 纳秒 | 0.552 纳秒 | 0.542 纳秒 | 0.25 |
public int GetOccurrences(string input, string needle)
{
int count = 0;
unchecked
{
if (string.IsNullOrEmpty(input) || string.IsNullOrEmpty(needle))
{
return 0;
}
for (var i = 0; i < input.Length - needle.Length + 1; i++)
{
var c = input[i];
if (c == needle[0])
{
for (var index = 0; index < needle.Length; index++)
{
c = input[i + index];
var n = needle[index];
if (c != n)
{
break;
}
else if (index == needle.Length - 1)
{
count++;
}
}
}
}
}
return count;
}
str="aaabbbbjjja";
int count = 0;
int size = str.Length;
string[] strarray = new string[size];
for (int i = 0; i < str.Length; i++)
{
strarray[i] = str.Substring(i, 1);
}
Array.Sort(strarray);
str = "";
for (int i = 0; i < strarray.Length - 1; i++)
{
if (strarray[i] == strarray[i + 1])
{
count++;
}
else
{
count++;
str = str + strarray[i] + count;
count = 0;
}
}
count++;
str = str + strarray[strarray.Length - 1] + count;
这是用于计算字符出现的次数。 对于此示例,输出将为“a4b4j3”
使用 System.Linq;
int CountOf => "A::BC::D".Split("::").Length - 1;
**计算字符或字符串**
string st = "asdfasdfasdfsadfasdf/asdfasdfas/dfsdfsdafsdfsd/fsadfasdf/dff";
int count = 0;
int location = 0;
while (st.IndexOf("/", location + 1) > 0)
{
count++;
location = st.IndexOf("/", location + 1);
}
MessageBox.Show(count.ToString());
查找char
数与查找string
数有很大不同。 这也取决于您是否希望能够检查多个。 如果你想检查各种不同的char
数,像这样的东西可以工作:
var charCounts =
haystack
.GroupBy(c => c)
.ToDictionary(g => g.Key, g => g.Count());
var needleCount = charCounts.ContainsKey(needle) ? charCounts[needle] : 0;
注意 1:分组到字典中非常有用,因此为它编写GroupToDictionary
扩展方法很有意义。
注意 2:您自己实现一个允许默认值的字典也很有用,然后您可以自动为不存在的键获取0
。
从 .NET 7 开始,我们拥有免分配(且高度优化)的正则表达式 API。 计数特别容易和高效。
var input = "abcd abcabc ababc";
var result = Regex.Count(input: input, pattern: "abc"); // 4
匹配动态模式时,请记住对它们进行转义:
public static int CountOccurences(string input, string pattern)
{
pattern = Regex.Escape(pattern); // Aww, no way to avoid heap allocations here
var result = Regex.Count(input: input, pattern: pattern);
return result;
}
而且,作为固定模式的奖励,.NET 7 引入了有助于将正则表达式字符串转换为源代码生成代码的分析器。 这不仅避免了正则表达式的运行时编译开销,而且还提供了非常易读的代码来展示它是如何实现的。 事实上,该代码通常至少与您手动编写的任何替代方案一样高效。
如果您的正则表达式调用符合条件,分析器将给出提示。 只需选择“转换为‘GeneratedRegexAttribute`”并享受结果:
[GeneratedRegex("abc")]
private static partial Regex MyRegex(); // Go To Definition to see the generated code
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.