简体   繁体   English

如何使用c#.net来识别文件是文本文件还是其他文件

[英]how to identify a file is a text file or other using c#.net

I need to access a file as text file and want to process it later. 我需要将文件作为文本文件访问,并希望稍后进行处理。 But before I fetch it how I can identify a file that I am taking is a text file only. 但是在获取它之前,如何识别正在提取的文件只是一个文本文件。 If file is in another format my whole code interpret wrongly. 如果文件是另一种格式,我的整个代码将被错误地解释。 I want to access and process only text file. 我只想访问和处理文本文件。

Currently i am using: 目前我正在使用:

StreamReader objReader = new StreamReader(filePath);

How can I do so in C# .NET? 如何在C#.NET中这样做?

Well, there are heuristics you could apply: 好吧,您可以应用启发式方法:

  • Use the file extension. 使用文件扩展名。 If it's ".txt" then it's probably a text file, if it's ".jpg" it probably isn't, etc. 如果是“ .txt”,则可能是文本文件;如果是“ .jpg”,则可能不是,等等。
  • If you know what encoding the file should be in, check whether it's valid in that encoding 如果您知道文件应采用哪种编码,请检查该编码是否有效
  • Check for common "magic numbers" at the start of the file to identify various well-known binary file types 在文件开头检查常见的“幻数”,以识别各种众所周知的二进制文件类型
  • If it's meant to be a Western document, check that if you read the file as a text file, most of it has relatively low Unicode values (typically less than U+0100, but you might want to look at the various Unicode code charts to decide for yourself) 如果要作为西方文档,请检查是否以文本文件形式读取文件,其中大多数文件的Unicode值相对较低(通常小于U + 0100,但是您可能希望查看各种Unicode代码表以了解自行决定)
  • Text files tend not to have many characters below U+0020 other than carriage return, line feed and tab 除回车符,换行符和制表符外,文本文件在U + 0020之下通常没有太多字符

But it's all heuristic, basically. 但这基本上都是启发式的。 At the end of the day, a file is a name and some bytes, along with some metadata about access permissions. 归根结底,文件是一个名称和一些字节,以及一些有关访问权限的元数据。 In some file systems there can be more metadata available, but it's typically hard to get at and often not preserved when copying files around - so shouldn't be relied on for this. 某些文件系统中,可能会有更多的元数据可用,但是通常很难获取,并且在复制文件时通常不保留元数据-因此,不应依赖于此。

If you want to get the extension of the file you can use 如果要获取文件扩展名,可以使用

Path.GetExtension method Path.GetExtension方法

If file is in another format my whole code interpret wrongly. 如果文件是另一种格式,我的整个代码将被错误地解释。

Sure, if you expect a text file and end up getting a binary file your code will interpret it wrongly. 当然,如果您期望一个文本文件并最终得到一个二进制文件,则您的代码将错误地解释它。 But so is also the case for any invalid text file: what if it's not comma separated when you expect that? 但是,对于任何无效的文本文件来说,情况也是如此:如果您期望的那样用逗号分隔,该怎么办? Or not json, when that's what you want? 还是不是json,这就是您想要的? Or is in an encoding you can't handle? 或者是您无法处理的编码?

The point is, unless you're just copying the text or doing something very low-level with it, you'll need more checking than text vs binary anyway. 关键是,除非您只是复制文本或对其进行非常低级的操作,否则无论是文本还是二进制文件,您都需要进行更多的检查。 You should (probably) check that the entire file conforms to your needs. 您应该(可能)检查整个文件是否符合您的需求。 And that will catch any non-text files that are passed in to your program too! 这也将捕获传递到程序中的所有非文本文件!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM