简体   繁体   中英

How to get rid of special characters at the beginning, while using File.ReadAllLines in C#

I tried string[] file = File.ReadAllLines(file_name) to read a word file.

In debug mode i found that the first few arguments of the string array file are having values like

" ࡱ 0\\0\\0\\0>\\0\\0 \\t\\0\\0\\0\\0\\0" . How can i get rid of this.

In certain files the first 3 arguments of the file[] are filled with these while for few files only the first argument is filled with these unreable characters.

What is the problem and how can i get rid of this.? But my word file does not even have a blank line at the beginning.

The problem is you're not opening the file with the correct encoding. Here is a guide to opening and creating Word documents from C#.

File.ReadAllLines is intended for text files. Word files are not text files. To read Word files you might need a library.

If you are using .NET 3.5 then I'd suggest that you use a LINQ where clause to return only the lines that you're interested in.

string[] file = File.ReadAllLines(file_name).Where(line => !line.StartsWith("��")).ToArray();

You could also use some form of regular expression instead of the line.StartsWith() method.

Note: If you are reading Microsoft Office Word files I'd recommend that you use the COM Interop or 3rd party library to read the MS Word Document (you'll find it much easier than trying to parse the file yourself).

Word files are not simple text files, so will have additional binary information embedded.

You should use a library that will read word documents if you want to extract the text properly, instead of File.ReadAllLines .

Here are a couple of such libraries .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM