简体   繁体   English

如何确定文本文件中的定界符

[英]How do I determine a delimiter in a text file

I have 2 types of input files: 1. comma delimited (ie: lastName, firstName, Address) 2. space delimited (ie lastName firstName Address) 我有2种类型的输入文件:1.以逗号分隔(例如:lastName,firstName,Address)2.以空格分隔(例如,lastName firstName Address)

The comma delimited file HAS spaces between the ',' and the next word. 逗号分隔文件在','和下一个单词之间有空格。

How do I go about determining which file I am dealing with ? 如何确定要处理的文件? I am using C# btw 我正在使用C#btw

I've done tons of work with various delimited file types and as everyone else is saying, without normalization you can't really handle the whole thing programmatically. 我已经完成了各种分隔文件类型的大量工作,并且正如其他所有人所说,没有规范化就无法真正以编程方式处理整个事情。

Generally (and it seems like it would be totally necessary for space-delim) a delimited file will have a text qualifier character (often double-quotes). 通常(对于空格分隔似乎很必要),带分隔符的文件将具有文本限定符(通常为双引号)。 A couple examples of this points: 这方面的几个例子:

Space Delimited: 空格分隔:

lastName "Von Marshall" is impossible without qualifiers. 没有限定词,lastName“冯·马歇尔”是不可能的。

Addresses would be altogether impossible as well. 地址也将完全不可能。

Comma Delimited: 以逗号分隔:

addresses are generally unworkable unless they are broken into separate fields or having a solid string is acceptable for your use-case. 地址通常是不可行的,除非将其分成单独的字段或用例来说可以接受实线字符串。

So the space delim should be easy enough to determine since you're looking for " " . 因此,由于您正在寻找" "因此delim空间应该足够容易确定。 If this is the case I'd (personally) replace all " " with "," to change it to comma-delim. 如果是这种情况,我(个人)将所有" "替换为"," ,将其更改为逗号分隔。 That way you'd only have to build a single method for handling the text, otherwise I imagine you'll need methods for spaces and commas separately. 这样,您只需要构建一个用于处理文本的方法即可,否则我想您将需要分别使用空格和逗号的方法。

If your comma-delim file does not have a text qualifier, you're in a really tricky spot. 如果您的逗号分隔文件没有文本限定符,那么您就很麻烦了。 I haven't found any "perfect" way of addressing this without any human work, but it can be minimized. 我还没有找到任何“完美”的方法来解决这个问题,而无需任何人工工作,但是可以将其最小化。 I've used Notepad++ a lot to do batch replacement with its regular expression functions. 我已经使用Notepad ++对其正则表达式函数进行了很多替换。

However, you can also use C#'s regex abilities. 但是,您也可以使用C#的正则表达式功能。 Here's what MSDN says on that. 这就是MSDN所说的。 So, to answer your question to the best of my ability, unless you can establish a uniqueness between the 2 file types - there's no way. 因此,要尽我所能回答您的问题,除非您可以在这两种文件类型之间建立唯一性-否则是不可能的。 However, if the text has proper text qualifiers, the files have different file extensions, or if the are generated in different directories - you could use any of those qualities or a mix thereof to decide what type of file it is. 但是,如果文本具有适当的文本限定符,则文件具有不同的文件扩展名,或者文件是在不同的目录中生成的-您可以使用这些质量中的任何一种或其混合来确定文件的类型。 I have no experience doing this as yet (though I've just started a project using it), so I can't give an exact example, but I can say for anyone to build a perfect example it'd be best if you showed example strings for each file. 我目前还没有这样做的经验(尽管我刚刚开始使用它进行一个项目),所以我无法给出确切的例子,但是我可以说任何人都可以建立一个完美的例子,如果您展示了,那将是最好的每个文件的示例字符串。

As other users have said with some guaranty of having no commas in the space delimited version you cannot with 100% accuracy. 正如其他用户所说的那样,以空格分隔的版本中没有逗号是不能保证100%准确的。

With some information, say that there will always be three fields for all records in all cases when parsed correctly you could just do both and test the results for the correct number of fields. 有了一些信息,可以说在正确解析的所有情况下,所有记录总会有三个字段,您可以同时执行这两个字段并测试结果的正确数量。 Address is a big block here though since we do not know what that format could be. 地址在这里是一个很大的障碍,因为我们不知道该格式可能是什么。 Also these rules seems odd at best when talking about address.... is 而且这些规则在谈论地址时充其量似乎很奇怪。

1111somestreest.houston,tx11111 or
1111 somestreet st. Houston, Tx 11111

a valid format? 有效格式?

You could count the number of commas per line of the file. 您可以计算文件每行的逗号数。 If you have at least 2 commas per line (considering your info is last name, first name, address), you probably have a comma separated. 如果每行至少有两个逗号(考虑到您的信息是姓,名,地址),则可能用逗号分隔。 If you have, in at least one line, less than 2 commas, you should consider it as space separated. 如果至少一行中少于两个逗号,则应将其视为空格分隔。

I, however, would skip this step and ignore the commas when evaluating the input by replacing all of them by spaces and would implement a single read/grab information procedure (considering only space separated files). 但是,在通过将所有输入替换为空格来评估输入时,我将跳过此步骤并忽略逗号,并且将实施单个读取/抓取信息过程(仅考虑以空格分隔的文件)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM