简体   繁体   English

在C#中拆分字符串

[英]Split String in C#

I thought this will be trivial but I can't get this to work. 我认为这将是微不足道的,但我不能让这个工作。

Assume a line in a CSV file: "Barack Obama", 48, "President", "1600 Penn Ave, Washington DC" 假设CSV文件中有一行: "Barack Obama", 48, "President", "1600 Penn Ave, Washington DC"

string[] tokens = line.split(',')

I expect this: 我期待这个:

 "Barack Obama"
 48
 "President"
 "1600 Penn Ave, Washington DC"

but the last token is 'Washington DC' not "1600 Penn Ave, Washington DC" . 但最后一个标志是'Washington DC'不是"1600 Penn Ave, Washington DC"

Is there an easy way to get the split function to ignore the comma within quotes? 是否有一种简单的方法可以让split函数忽略引号内的逗号?

I have no control over the CSV file and it doesn;t get sent to me. 我无法控制CSV文件但它没有发送给我。 Customer A will be using the app to read files provided by an external individual. 客户A将使用该应用程序读取外部个人提供的文件。

You might have to write your own split function. 您可能必须编写自己的拆分功能。

  • Iterate through each char in the string 遍历字符串中的每个字符
  • When you hit a " character, toggle a boolean 当你点击"字符时,切换一个布尔值
  • When you hit a comma, if the bool is true, ignore it, else, you have your token 当你点击逗号时,如果bool为true,则忽略它,否则你有你的令牌

Here's an example: 这是一个例子:

public static class StringExtensions
{
    public static string[] SplitQuoted(this string input, char separator, char quotechar)
    {
        List<string> tokens = new List<string>();

        StringBuilder sb = new StringBuilder();
        bool escaped = false;
        foreach (char c in input)
        {
            if (c.Equals(separator) && !escaped)
            {
                // we have a token
                tokens.Add(sb.ToString().Trim());
                sb.Clear();
            }
            else if (c.Equals(separator) && escaped)
            {
                // ignore but add to string
                sb.Append(c);
            }
            else if (c.Equals(quotechar))
            {
                escaped = !escaped;
                sb.Append(c);
            }
            else
            {
                sb.Append(c);
            }
        }
        tokens.Add(sb.ToString().Trim());

        return tokens.ToArray();
    }
}

Then just call: 然后打电话:

string[] tokens = line.SplitQuoted(',','\"');

Benchmarks 基准

Results of benchmarking my code and Dan Tao's code are below. 对我的代码和Dan Tao的代码进行基准测试的结果如下。 I'm happy to benchmark any other solutions if people want them? 如果有人想要,我很乐意为任何其他解决方案做基准测试吗?

Code: 码:

string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;

// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
    tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);

start = DateTime.Now;
for (int i = 0; i<1000000;i++)
    tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted =        {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);

Output: 输出:

1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted =        2406.25ms

I have a SplitWithQualifier extension method that I use here and there, which utilizes Regex . 我有一个SplitWithQualifier扩展方法,我在这里和那里使用,它使用正则Regex

I make no claim as to the robustness of this code, but it has worked all right for me for a while. 我没有声明这段代码的健壮性,但它对我来说已经有一段时间了。

// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
    public static string[] SplitWithQualifier(this string text,
                                              char delimiter,
                                              char qualifier,
                                              bool stripQualifierFromResult)
    {
        string pattern = string.Format(
            @"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
            Regex.Escape(delimiter.ToString()),
            Regex.Escape(qualifier.ToString())
        );

        string[] split = Regex.Split(text, pattern);

        if (stripQualifierFromResult)
            return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
        else
            return split;
    }
}

Usage: 用法:

string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);

foreach (string value in values)
    Console.WriteLine(value);

Output: 输出:

Barak Obama
48
President
1600 Penn Ave, Washington DC

I see from the bigger picture that you are actually trying to parse CSV input. 我从大局看到你实际上是在尝试解析CSV输入。 So instead of advising on how to split the string properly, I would instead recommend you to use a CSV parser to do this kind of thing. 因此,我建议您使用CSV解析器来执行此类操作,而不是建议如何正确拆分字符串。

A Fast CSV Reader 快速CSV阅读器

One that I would recommend is the library (source code available) that you can get from this CodeProject page: http://www.codeproject.com/KB/database/CsvReader.aspx 我建议的是可从此CodeProject页面获取的库(可用源代码): http//www.codeproject.com/KB/database/CsvReader.aspx

I personally use it myself and like it. 我亲自使用它并喜欢它。 It's a .NET native code and a lot faster than using OLEDB (which also can do the CSV parsing for you, but believe me, it's slow). 它是一个.NET本机代码,比使用OLEDB快得多(它也可以为你做CSV解析,但相信我,它很慢)。

You should be using Microsoft.VisualBasic.FileIO.TextFieldParser for that. 你应该使用Microsoft.VisualBasic.FileIO.TextFieldParser It will handle all the CSV stuff correctly for you, see: A similar question with example using the TextFieldParser 它将为您正确处理所有CSV内容,请参阅: 使用TextFieldParser的示例中的类似问题

PS: Do not fear using the Microsoft.VisualBasic dll in a C# project, it's all .NET :-) PS:不要害怕在C#项目中使用Microsoft.VisualBasic dll,它都是.NET :-)

That would be the expected behavior as quotes are just another string character in C#. 这将是预期的行为,因为引号只是C#中的另一个字符串字符。 Looks like what you are after is the quoted tokens or numeric tokens. 看起来你所追求的是引用的标记或数字标记。

I think you might need to use Regex to split the strings unless some one else knows a better way. 我认为您可能需要使用Regex来分割字符串,除非其他人知道更好的方法。

Or you could just loop through the string one character at a time building up the string as you go and build the tokens that way. 或者你可以在一个字符串中循环遍历字符串,一次构建字符串并以这种方式构建标记。 It's old school but may be the most reliable way in your case. 这是旧学校,但在您的情况下可能是最可靠的方式。

You can't parse a CSV line with a simple Split on commas, because some cell contents will contain commas that aren't meant to delineate data but are actually part of the cell contents themselves. 您无法使用简单的逗号分割来解析CSV行,因为某些单元格内容将包含逗号,这些逗号不是为了描述数据,而是实际上是单元格内容本身的一部分。

Here is a link to a simple regex-based C# method that will convert your CSV into a handly DataTable : 这是一个简单的基于正则表达式的C#方法的链接,它将您的CSV转换为一个手动的DataTable

http://www.hotblue.com/article0000.aspx?a=0006 http://www.hotblue.com/article0000.aspx?a=0006

Working with DataTables is very easy - let me know if you need a code sample for that. 使用DataTables非常简单 - 如果您需要代码示例,请告诉我。

I would recommend using a regular expression instead. 我建议使用正则表达式。 It will allow you to extract more complicated substrings in a much more versatile manner (precisely as you want). 它将允许您以更加通用的方式提取更复杂的子串(正如您所希望的那样)。

http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx

http://oreilly.com/windows/archive/csharp-regular-expressions.html http://oreilly.com/windows/archive/csharp-regular-expressions.html

Can't you change how the CSV is generated? 你不能改变CSV的生成方式吗? Using OpenOffice, you can set the char separator (use ;) and how the string is delimited (using " or '). 使用OpenOffice,您可以设置char分隔符(use;)以及字符串的分隔方式(使用“或”)。

It would be like this: 'President';'1600 Penn Ave, Washington DC' 它将是这样的:'总统';'1600 Penn Ave,华盛顿特区'

string temp = line.Replace( "\\"", "" ); string temp = line.Replace(“\\”“,”“);

string[] tokens = temp.Split(',') string [] tokens = temp.Split(',')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM