简体   繁体   English

替换字符串中的重复字符

[英]Replacing repetitive characters in a string

Is it possible to find and replace any repetitive characters in a string using C#? 是否可以使用C#查找和替换字符串中的任何重复字符? I'm trying to reduce the size of a base64 string, which is converted from a jpeg image. 我正在尝试减小base64字符串的大小,该字符串是从jpeg图像转换而来的。 I've noticed that the base64 strings contain many repeated characters such as: 我注意到base64字符串包含许多重复的字符,例如:

6qdQAUUxJA7uuCGQ8g/wA6fQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFYXiFL5b7TrmwtzM8Xmr7KWUAE+ 6qdQAUUxJA7uuCGQ8g / wA6fQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFYXiFL5b7TrmwtzM8Xmr7KWUAE +

If there was a way to remove the repetitive characters with something like this it would overall be much smaller: 如果有办法用这样的东西删除重复的字符,它总体上要小得多:

[QAUUUUAFFFFABRRR, 18] [QAUUUUAFFFFABRRR,18]

This is in the format of [REPEATED-CHARACTERS, NUMBER-OF-TIMES]. 其格式为[REPEATED-CHARACTERS,NUMBER-OF-TIMES]。
Would this be possible to do? 这有可能吗? Thanks for the help. 谢谢您的帮助。 :) :)

You would essentially have to create a search and replace function. 您基本上必须创建搜索和替换功能。 It really depends on whether or not the repetitive strings are of a constant length. 这实际上取决于重复字符串是否具有恒定长度。 In your example, the repetitive string is 16 characters long, so you could write a routing that grabs the first 16 characters, compares them to the next 16 characters, and so on until it finds a string that is different. 在您的示例中,重复字符串长度为16个字符,因此您可以编写一个路由来抓取前16个字符,将它们与接下来的16个字符进行比较,依此类推,直到找到不同的字符串。 It would then replace the string with your syntax to represent them. 然后它将用您的语法替换字符串来表示它们。

If the length of the repetitive string is variable, then it's a little more complex. 如果重复字符串的长度是可变的,那么它会更复杂一些。 You would essentially have to start with a short string, and keep growing it, and comparing it to the next set of characters of the same length, if they repeat, check the next ones and so on. 你基本上必须从一个短字符串开始,并继续增长它,并将它与下一组相同长度的字符进行比较,如果它们重复,检查下一个字符串,依此类推。 This could be hit and miss though. 尽管如此,这可能会受到影响。

Do a search on compression algorithms, as many of them work on similar principals. 搜索压缩算法,因为其中许多算法都适用于类似的主体。

You can find longest string with maximum repeats. 您可以找到最长重复的字符串。

int mx = -1;
string str = null;
for (int i = 0; i < str.Length; i++) for (int j = i + 1; j < str.Length; j++)
{
string sub = str.Substring(i, j - i);
int tmp = countAll(str, sub); // write countAll() yourself
if (tmp > mx) { mx = tmp; str = sub; }
}

Or, better, use a Dictionary . 或者,更好的是,使用Dictionary

Dictionary<char, int> rep = new Dictionary<char, int>();
for (int i = 0; i < str.Length; i++)
  if (rep.ContainsKey(str[i])) rep[str[i]]++;
  else rep.Add(str[i], 1);

You will have then each character assoicaited with it the number of occurrences: 然后,您将拥有与之相关的每个字符的出现次数:

string total = "";
foreach (var item in rep) total += item.Key;

ADD : 添加

If you really want to find the longest repeated substring, then your should use Dynamic Programming to solve this problem, instead. 如果你真的想找到最长的重复子串,那么你应该使用动态编程来解决这个问题。

You're essentially trying to come up with your own lossless compression algorithm - algorithms like zip work by doing exactly what you're asking for, except that they work on bytes rather than characters in a string. 你本质上是试图想出你自己的无损压缩算法 - 像拉链工作这样的算法,完全按照你的要求去做,除了它们工作在字节而不是字符串中的字符。

Popular compression algorithms are virtually guaranteed to be more efficient than something you can design and implement in a reasonable amount of time. 流行的压缩算法几乎可以保证比在合理的时间内设计和实现的更有效。 For one, they will probably see patterns that aren't evident in the base64 string due to byte alignment issues. 首先,由于字节对齐问题,他们可能会看到base64字符串中不明显的模式。

So why not just use one of them to compress the binary data before base64-encoding it, instead of the other way around? 那么为什么不在使用base64编码之前使用其中一个压缩二进制数据,而不是相反呢?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM