简体   繁体   English

C# - 在数据中使用转义的 pipe 在 pipe 上拆分?

[英]C# - Splitting on a pipe with an escaped pipe in the data?

I've got a pipe delimited file that I would like to split (I'm using C#).我有一个要拆分的 pipe 分隔文件(我使用的是 C#)。 For example:例如:

This|is|a|test

However, some of the data can contain a pipe in it.但是,某些数据中可能包含 pipe。 If it does, it will be escaped with a backslash:如果是这样,它将用反斜杠转义:

This|is|a|pip\|ed|test (this is a pip|ed test)

I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them).我想知道是否有正则表达式或其他方法可以仅在“纯”管道(即前面没有反斜杠的管道)上将其分开。 My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe.我目前的方法是用自定义的文本位替换转义的管道,在管道上拆分,然后用 pipe 替换我的自定义文本。 Not very elegant and I can't help but think there's a better way.不是很优雅,我不禁认为有更好的方法。 Thanks for any help.谢谢你的帮助。

Just use String.IndexOf() to find the next pipe.只需使用String.IndexOf()查找下一个 pipe。 If the previous character is not a backslash, then use String.Substring() to extract the word.如果前一个字符不是反斜杠,则使用String.Substring()提取单词。 Alternatively, you could use String.IndexOfAny() to find the next occurrence of either the pipe or backslash.或者,您可以使用String.IndexOfAny()查找下一个出现的 pipe 或反斜杠。

I do a lot of parsing like this, and this is really pretty straight forward.我做了很多这样的解析,这真的很简单。 Taking my approach, if done correctly will also tend to run faster as well.采用我的方法,如果做得正确,也往往会跑得更快。

EDIT编辑

In fact, maybe something like this.事实上,也许是这样的。 It would be interesting to see how this compares performance-wise to a RegEx solution.看看这在性能方面如何与 RegEx 解决方案进行比较会很有趣。

public List<string> ParseWords(string s)
{
    List<string> words = new List<string>();

    int pos = 0;
    while (pos < s.Length)
    {
        // Get word start
        int start = pos;

        // Get word end
        pos = s.IndexOf('|', pos);
        while (pos > 0 && s[pos - 1] == '\\')
        {
            pos++;
            pos = s.IndexOf('|', pos);
        }

        // Adjust for pipe not found
        if (pos < 0)
            pos = s.Length;

        // Extract this word
        words.Add(s.Substring(start, pos - start));

        // Skip over pipe
        if (pos < s.Length)
            pos++;
    }
    return words;
}

This oughta do it:这应该这样做:

string test = @"This|is|a|pip\|ed|test (this is a pip|ed test)";
string[] parts = Regex.Split(test, @"(?<!(?<!\\)*\\)\|");

The regular expression basically says: split on pipes that aren't preceded by an escape character.正则表达式基本上说:在前面没有转义字符的管道上拆分。 I shouldn't take any credit for this though, I just hijacked the regular expression from this post and simplified it.不过,我不应该为此付出任何代价,我只是从这篇文章中劫持了正则表达式并对其进行了简化。

EDIT编辑

In terms of performance, compared to the manual parsing method provided in this thread, I found that this Regex implementation is 3 to 5 times slower than Jonathon Wood's implementation using the longer test string provided by the OP.在性能方面,与这个线程中提供的手动解析方法相比,我发现这个 Regex 实现比使用 OP 提供的更长的测试字符串的 Jonathon Wood 的实现慢 3 到 5 倍。

With that said, if you don't instantiate or add the words to a List<string> and return void instead, Jon's method comes in at about 5 times faster than the Regex.Split() method (0.01ms vs. 0.002ms) for purely splitting up the string.话虽如此,如果您不实例化或将单词添加到List<string>并返回 void,Jon 的方法比Regex.Split()方法快大约 5 倍(0.01ms 与 0.002ms)用于纯粹拆分字符串。 If you add back the overhead of managing and returning a List<string> , it was about 3.6 times faster (0.01ms vs. 0.00275ms), averaged over a few sets of a million iterations.如果再加上管理和返回List<string>的开销,它大约快 3.6 倍(0.01 毫秒对 0.00275 毫秒),平均超过几组一百万次迭代。 I did not use the static Regex.Split() for this test, I instead created a new Regex instance with the expression above outside of my test loop and then called its Split method.我没有使用 static Regex.Split() 进行此测试,而是在我的测试循环之外使用上面的表达式创建了一个新的 Regex 实例,然后调用它的 Split 方法。

UPDATE更新

Using the static Regex.Split() function is actually a lot faster than reusing an instance of the expression.使用 static Regex.Split() function 实际上比重用表达式的实例快很多。 With this implementation, the use of regex is only about 1.6 times slower than Jon's implementation (0.0043ms vs. 0.00275ms)有了这个实现,正则表达式的使用只比 Jon 的实现慢了大约 1.6 倍(0.0043ms vs. 0.00275ms)

The results were the same using the extended regular expression from the post I linked to.使用我链接到的帖子中的扩展正则表达式的结果是相同的。

I came across a similar scenario, For me the count of number of pipes were fixed(not pipes with "\|").我遇到了类似的情况,对我来说,管道的数量是固定的(不是带有“\|”的管道)。 This is how i have handled.我就是这样处理的。

string sPipeSplit = "This|is|a|pip\\|ed|test (this is a pip|ed test)";
string sTempString = sPipeSplit.Replace("\\|", "¬"); //replace \| with non printable character
string[] sSplitString = sTempString.Split('|');
//string sFirstString = sSplitString[0].Replace("¬", "\\|"); //If you have fixed number of fields and you are copying to other field use replace while copying to other field.
/* Or you could use a loop to replace everything at once
foreach (string si in sSplitString)
{
    si.Replace("¬", "\\|");
}
*/

Here is another solution.这是另一个解决方案。

One of the most beautiful thing about programming, is the several ways of giving a solution to the same problem:编程最美妙的事情之一就是为同一问题提供解决方案的几种方法:

string text = @"This|is|a|pip\|ed|test"; //The original text
string parsed = ""; //Where you will store the parsed string

bool flag = false;
foreach (var x in text.Split('|')) {
    bool endsWithArroba = x.EndsWith(@"\");
    parsed += flag ? "|" + x + " " : endsWithArroba ? x.Substring(0, x.Length-1) : x + " ";
    flag = endsWithArroba;
}

Cory's solution is pretty good.科里的解决方案非常好。 But, i fyou prefer not to work with Regex, then you can simply do something searching for "\|"但是,如果您不想使用正则表达式,那么您可以简单地搜索“\|” and replacing it with some other character, then doing your split, then replace it again with the "\|".并用其他字符替换它,然后进行拆分,然后再次用“\ |”替换它。

Another option is is to do the split, then examine all the strings and if the last character is a \, then join it with the next string.另一种选择是进行拆分,然后检查所有字符串,如果最后一个字符是 \,则将其与下一个字符串连接。

Of course, all this ignores what happens if you need an escaped backslash before a pipe.. like "\\|".当然,如果您需要在 pipe.. 之前使用转义的反斜杠,所有这些都忽略了会发生什么,例如“\\|”。

Overall, I lean towards regex though.总的来说,我倾向于正则表达式。

Frankly, I prefer to use FileHelpers because, even though this isn't comma delimeted, it's basically the same thing.坦率地说,我更喜欢使用FileHelpers ,因为尽管这不是逗号分隔的,但它基本上是一样的。 And they have a great story about why you shouldn't write this stuff yourself .他们有一个很棒的故事,说明为什么你不应该自己写这些东西

You can do this with a regex.您可以使用正则表达式执行此操作。 Once you decide to use a backslash as your escape character, you have two escape cases to account for:一旦您决定使用反斜杠作为转义字符,您需要考虑两种转义情况:

  • Escaping a pipe: \| Escaping 和 pipe: \|
  • Escaping a backslash that you want interpreted literally. Escaping 一个反斜杠,您想按字面意思解释。

Both of these can be done in the same regex.这两者都可以在同一个正则表达式中完成。 Escaped backslashes will always be two \ characters together.转义的反斜杠将始终是两个\字符在一起。 Consecutive, escaped backslashes will always be even numbers of \ characters.连续的转义反斜杠将始终是偶数个\字符。 If you find an odd-numbered sequence of \ before a pipe, it means you have several escaped backslashes, followed by an escaped pipe.如果您在 pipe 之前发现一个奇数的\序列,这意味着您有几个转义的反斜杠,然后是一个转义的 pipe。 So you want to use something like this:所以你想使用这样的东西:

/^(?:((?:[^|\\]|(?:\\{2})|\\\|)+)(?:\||$))*/

Confusing, perhaps, but it should work.也许令人困惑,但它应该有效。 Explanation:解释:

^              #The start of a line
(?:...
    [^|\\]     #A character other than | or \ OR
    (?:\\{2})* #An even number of \ characters OR
    \\\|       #A literal \ followed by a literal |
...)+          #Repeat the preceding at least once
(?:$|\|)       #Either a literal | or the end of a line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM