How to detect the text encoding of a file

Today I needed a way to identify ANSI (Windows-1252) and UTF-8 files in a directory filled with files of these two types. I was surprised to not find a simple way of doing this via a property of method somewhere under the System.IO namespace.

Not that it's that hard to identify the encoding programmatically, but it's always better when you don't need to write a method yourself. Anyway, here's what I came up with. It detects UTF-8 encoding based on the encoding signature added to the beginning of the file.

The code below is specific to UTF-8 but shouldn't be too hard to extend the example to detect more encodings.

public static bool IsUtf8(string fname){
  using(var f = File.Open(fname, FileMode.Open)){
    var sig = new byte[Encoding.UTF8.GetPreamble().Length];
    f.Read(sig, 0, sig.Length);
    return sig.SequenceEqual(Encoding.UTF8.GetPreamble());
  }
}

Maybe I just looked in the wrong places. Does anyone know a simpler way in the framework to accomplish this?


Posted 01-26-2010 7:18 PM by sergiopereira
Filed under: ,

[Advertisement]

Comments

Nicholas Piasecki wrote re: How to detect the text encoding of a file
on 01-26-2010 10:37 PM

You could P/Invoke to IsTextUnicode().

msdn.microsoft.com/.../dd318672%28VS.85%29.aspx

Krzysztof Koźmic wrote re: How to detect the text encoding of a file
on 01-27-2010 3:00 AM

a. there is no way to do it simpler way

b. it won't always work, as preamble is not mandatory

Anders Lybecker wrote re: How to detect the text encoding of a file
on 01-27-2010 3:02 AM

Nice one. :-)

It's not possible to detect codepage - ANSI or other :-(

blippe wrote re: How to detect the text encoding of a file
on 01-27-2010 4:02 AM

The BOM in utf8 is not only "not mandatory" as Kozmic points out, it is "not recommended" and kind of defeats one of the main points of using utf8.

What you got here is a quick way of checking if a file is an "broken" utf8.

Pablo Alarcón wrote re: How to detect the text encoding of a file
on 01-27-2010 4:58 AM

Krzystof and Anders are right, the preamble ( AKA BOM,Byte Order Mark ) is not always present and it is a UNICODE only thing.

So there's no way to detect which ANSI code page was used to write the text.

PS: I read once that InternetExplorer tries some heuristics based on the frecuency of each character in the page.

x wrote re: How to detect the text encoding of a file
on 01-27-2010 7:58 AM

Heuristics is the only thing you can do - basically guessing.

BTW, this heuristics in IE (also for content type) was the cause of many security flaws in it.

Daniel Weck wrote re: How to detect the text encoding of a file
on 01-27-2010 8:11 AM

The Java port of Mozilla Charset Detector ("chardet") detects a broad set of text encodings using various heuristic patterns. It shouldn't be hard to port to C# (or generally-speaking to .NET):

www.mozilla.org/.../chardet.html

jchardet.sourceforge.net

Cheers, Dan

uberVU - social comments wrote Social comments and analytics for this post
on 01-28-2010 2:53 PM

This post was mentioned on Twitter by sergiopereira: Blogged. How to detect the text encoding of a file http://bit.ly/dgSHoL

Surya wrote re: How to detect the text encoding of a file
on 01-29-2010 12:04 AM

There is one available for Python:

http://chardet.feedparser.org/

Enjoy.

Jeeva wrote re: How to detect the text encoding of a file
on 01-31-2010 11:52 AM

About The CodeBetter.Com Blog Network
CodeBetter.Com FAQ

Our Mission

Advertisers should contact Brendan

Subscribe
Google Reader or Homepage

del.icio.us CodeBetter.com Latest Items
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online
Subscribe with myFeedster
Add to My AOL
Furl CodeBetter.com Latest Items
Subscribe in Rojo

Member Projects
DimeCasts.Net - Derik Whittaker

Friends of Devlicio.us
Red-Gate Tools For SQL and .NET

NDepend

SlickEdit
 
SmartInspect .NET Logging
NGEDIT: ViEmu and Codekana
LiteAccounting.Com
DevExpress
Fixx
NHibernate Profiler
Unfuddle
Balsamiq Mockups
Scrumy
JetBrains - ReSharper
Umbraco
NServiceBus
RavenDb
Web Sequence Diagrams
Ducksboard<-- NEW Friend!

 



Site Copyright © 2007 CodeBetter.Com
Content Copyright Individual Bloggers

 

Community Server (Commercial Edition)