简体   繁体   English

C#从带引号的字符串中删除分隔符

[英]C# Removing separator characters from quoted strings

I'm writing a program that has to remove separator characters from quoted strings in text files. 我正在编写一个程序,必须从文本文件中的引用字符串中删除分隔符。

For example: 例如:

"Hello, my name is world"

Has to be: 必须:

"Hello my name is world"

This sounds quite easy at first (I thought it would be), but you need to detect when the quote starts, when the quote ends, then search that specific string for separator characters. 这听起来很容易(我认为会这样),但你需要检测引用何时开始,当引用结束时,然后搜索特定字符串以查找分隔符。 How? 怎么样?

I've experimented with some Regexs but I just keep getting myself confused! 我已经尝试了一些正则表达式,但我只是让自己感到困惑!

Any ideas? 有任何想法吗? Even just something to get the ball rolling, I'm just completely stumped. 即使只是让球滚动的东西,我只是完全难倒。

string pattern = "\"([^\"]+)\"";
value = Regex.Match(textToSearch, pattern).Value;

string[] removalCharacters = {",",";"}; //or any other characters
foreach (string character in removalCharacters)
{
    value = value.Replace(character, "");
}

why not try and do it with Linq ? 为什么不尝试用Linq做呢?

var x = @" this is a great whatever ""Hello, my name is world"" and all that";

var result = string.Join(@"""", x.Split('"').
Select((val, index) => index%2 == 1 ? 
val.Replace(",", "") : val).ToArray());

Using a regex pattern with a look-ahead the pattern would be: "\\"(?=[^\\"]+,)[^\\"]+\\"" 使用具有前瞻性的正则表达式模式将是: "\\"(?=[^\\"]+,)[^\\"]+\\""

The \\" matches the opening double-quote. The look-ahead (?=[^\\"]+,) will try to match a comma within the quoted text. \\"匹配开头的双引号。预见(?=[^\\"]+,)将尝试匹配引用文本中的逗号。 Next we match the rest of the string as long as it's not a double-quote [^\\"]+ , then we match the closing double-quote \\" . 接下来我们匹配字符串的其余部分,只要它不是双引号[^\\"]+ ,然后我们匹配结束双引号\\"

Using Regex.Replace allows for a compact approach to altering the result and removing the unwanted commas. 使用Regex.Replace可以使用紧凑的方法来更改结果并删除不需要的逗号。

string input = "\"Hello, my name, is world\"";
string pattern = "\"(?=[^\"]+,)[^\"]+\"";
string result = Regex.Replace(input, pattern, m => m.Value.Replace(",", ""));
Console.WriteLine(result);

What you want to write is called a "lexer" (or alternatively a "tokenizer"), that reads the input character by character and breaks it up into tokens. 您想要编写的内容称为“词法分析器”(或者称为“标记器”),它按字符读取输入字符并将其分解为标记。 That's generally how parsing in a compiler works (as a first step). 这通常是编译器中解析的工作方式(作为第一步)。 A lexer will break text up into a stream of tokens (string literal, identifer, "(", etc). The parser then takes those tokens, and uses them to produce a parse tree. 词法分析器将文本分解为一个标记流(字符串文字,标识符,“(”等)。解析器然后获取这些标记,并使用它们生成一个解析树。

In your case, you only need a lexer. 在你的情况下,你只需要一个词法分析器。 You will have 2 types of tokens "quoted strings", and "everything else". 您将有两种类型的令牌“引用字符串”和“其他所有”。

You then just need to write code to break the input up into tokens. 然后,您只需编写代码即可将输入分解为令牌。 By default something is an "everything else" token. 默认情况下,某些东西是“其他所有”令牌。 A string token starts when you see a ", and ends when you see the next ". 当您看到“,当您看到下一个”结束时,字符串标记开始。 If you are reading source code you may have to deal with things like \\" or "" as special cases. 如果您正在阅读源代码,则可能需要处理“或”作为特殊情况。

Once you have done that, then you can just iterate over the tokens and do what ever processing you need on the "string" tokens. 完成后,您可以迭代令牌,并在“字符串”令牌上执行您需要的处理。

So I guess you have some long text with a lot of quotes inside? 所以我猜你有一些很长的文字里面有很多引号? I would make a method that does something like this: 我会做一个像这样的方法:

  1. Run thought the string until you encounter the first " 运行思想字符串,直到你遇到第一个“
  2. Then take the substring up till the next ", and do a str.Replace(",","") and also replace any other characters that you want to replace. 然后将子字符串向上移动到下一个“,并执行str.Replace(”,“,”“)并替换要替换的任何其他字符。
  3. Then go without replacing until you encounter the next " and continue until the end. 然后去,直到你遇到下一个“并继续直到结束。

EDIT 编辑

I just got a better idea. 我只是有了一个更好的主意。 What about this: 那这个呢:

  string mycompletestring = "This is a string\"containing, a quote\"and some more text";
  string[] splitstring = mycompletestring.Split('"');
  for (int i = 1; i < splitstring.Length; i += 2) {
    splitstring[i] = splitstring[i].Replace(",", "");
  }
  StringBuilder builder = new StringBuilder();
  foreach (string s in splitstring) {
    builder.Append(s + '"');
  }
  mycompletestring = builder.ToString().Substring(0, builder.ToString().Length - 1);

I think there should be a better way of combining the string into one with a " between them at the end, but I don't know any better ones, so feel free to suggest a good method here :) 我认为应该有一种更好的方法将字符串组合成一个“最后在它们之间,但我不知道更好的方法,所以请随意在这里建议一个好的方法:)

I've had to do something similar in an application I use to translate flat files. 我必须在我用来翻译平面文件的应用程序中做类似的事情。 This is the approach I took: (just a copy/paste from my application) 这是我采取的方法:(只是从我的应用程序复制/粘贴)

        protected virtual string[] delimitCVSBuffer(string inputBuffer) {
        List<string> output       = new List<string>();
        bool insideQuotes         = false;
        StringBuilder fieldBuffer = new StringBuilder();
        foreach (char c in inputBuffer) {
            if (c == FieldDelimiter && !insideQuotes) {
                output.Add(fieldBuffer.Remove(0, 1).Remove(fieldBuffer.Length - 1, 1).ToString().Trim());
                fieldBuffer.Clear();
                continue;
            } else if (c == '\"')
                insideQuotes = !insideQuotes;
            fieldBuffer.Append(c);
        }
        output.Add(fieldBuffer.Remove(0, 1).Remove(fieldBuffer.Length - 1, 1).ToString().Trim());
        return output.ToArray();
    }

Ok, this is a bit wacky, but it works. 好吧,这有点古怪,但它确实有效。

So first off you split your string up into parts, based on the " character: 所以首先根据"角色:将你的字符串分成几部分:

string msg = "this string should have a comma here,\"but, there should be no comma in this bit\", and there should be a comma back at that and";

var parts = msg.Split('"');

then you need to join the string back together on the " character, after removing each comma in every other part: 那么你需要在删除每个其他部分中的每个逗号之后将字符串重新加入"字符"

string result = string.Join("\"", RemoveCommaFromEveryOther(parts));

The removal function looks like this: 删除功能如下所示:

IEnumerable<string> RemoveCommaFromEveryOther(IEnumerable<string> parts)
{
    using (var partenum = parts.GetEnumerator())
    {
        bool replace = false;
        while (partenum.MoveNext())
        {
            if(replace)
            {
                yield return partenum.Current.Replace(",","");
                replace = false;
            }
            else
            {
                yield return partenum.Current;
                replace = true;
            }
        }
    }
}

The does require that you include a using directive for System.Collections.Generic . 这确实要求您包含System.Collections.Generic的using指令。

There are many ways to do this: Lok at the functions string.Split() and string.IndexOfAny() 有很多方法可以做到这一点:函数string.Split()string.IndexOfAny()

You can use string.Split(new char[] {',',' '}, StringSplitOption.RemoveEmptyEntries) to slipt the phrase into words, then use the StringBuilder class to put the words together. 您可以使用string.Split(new char [] {',',''},StringSplitOption.RemoveEmptyEntries)将短语滑入单词,然后使用StringBuilder类将单词放在一起。

Calling string.Replace("[char to remove goes here]"',"") multiple times with each char you want to remove will also work. 使用string.Replace("[char to remove goes here]"',"")每个字符多次调用string.Replace("[char to remove goes here]"',"")也可以使用。

EDIT: 编辑:

Call string.Split(new char[] {'\\"'}, StringSplitOption.RemoveEmptyEntries) to obtain an array of the strings that are between quotes ( " ) then call Replace on each of them, then put the strings together with StringBuilder . 调用string.Split(new char[] {'\\"'}, StringSplitOption.RemoveEmptyEntries)获取引号(“)之间的字符串数组,然后在每个字符串上调用Replace ,然后将字符串与StringBuilder放在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM