If you need to read binary files or binary streams, you need to know whether the file you’re reading is big-endian or little-endian.
It’s about the order that the bytes are stored in the file. For example the number 1, stored as a 32 bit integer, is 00000001, occupying 4 bytes. That means you need a sequence of four bytes in the file to store the number. If you put the byte that holds the most significant bits first, then the bytes look like this:
The official term for that is big-endian. On the other hand, if you put the least significant byte first in the file, then the bytes look like this:
That’s known in the trade as little-endian. Obviously, if you are writing code to read a binary file, you’d better make sure you know which convention is being used in the file because if your program tries to read the bytes the wrong way round, it’ll probably read the number 1 as if it was hex 01000000 which works out at about 16 million (16 777 216). If the number being read is the number of bugs assigned to your intern to fix before lunchtime today, and your (perfectly healthy) intern suddenly dies of a heart attack then – well, don’t say I didn’t warn you that you need to check the endian-ness!
As it happens I’m currently working with binary files. This is one of the files that the app I’m working on needs to read – viewed using the free Hex Editor Neo from HHD.
Each byte of the file is displayed as a 2-digit hex number, so the first two bytes in the file both contain zero, the third byte contains the value 0x27, the fourth byte contains the value 0x0a, and so on.
(If you must know, the file is an ESRI shape file that contains geographical data about UK railway lines, And the Hex Editor Neo is very good by the way. I highly recommend it if you want to view binary files).But back to the point, look at the four bytes I’ve circled in red. That is actually the number 1, stored in 32-bit big-endian format. On the other hand, the number I’ve circled in blue is actually the number 3, stored in 32-bit little-endian format – with the least significant byte first. What, both formats mixed up in the same file? Yep, sometimes, stuff in computing is just weird. Don’t blame me – I didn’t write the ESRI shape file format, I just have to read the darn thing.
Personally I like big-endian because it’s easy for a human to read. With little-endian you have to mentally reverse the order of the bytes to see what number it is, but with big-endian you don’t have to do that. That’s because we always write numbers in big-endian on paper. In the number 73, the 7 is the most significant digit, and it comes first.
By the way, which convention do you think Windows usually uses? Yep, you got it, Windows normally uses little-endian. Just to make things difficult for humans. Actually, it’s not really Microsoft being awkward – there are good historical reasons for this choice, that are to do with (roughly speaking) little endian being slightly simpler to build hardware for a very long time ago when hardware was expensive. And the result is that normally, the hardware that Windows runs on is itself little-endian. But that’s getting way off topic for what I want to talk about.
So let’s recap:
- Little-endian means that the least significant bits come first. It’s what you’ll normally find Windows uses by default.
- Big-endian means that the most significant bits come first.
So the concept is quite simple. I doubt anyone has any trouble understanding the idea. Where the trouble comes is in trying to remember which is which.
And the terminology really doesn’t help here. Arguably, both the words big-endian and little-endian actually mean the opposite of what you’d logically think if break the words down. The word big-endian makes it sound like the big (ie. most significant) bits come at the end (ie. last). But it’s the opposite – big-endian means the big bits come first.
Similarly little-endian means that the little (least significant) bits come first, even though it sounds like it they should come last.
It would really make a lot more sense if the conventions were called big-first and little-first. But they aren’t and it sucks and there’s not much you or I can do about it.
And that’s my mnemonic to remember the convention with: I just remind myself that lots of things in programming are warped, strange, and counterintuitive, and big-endian/little-endian follows that pattern perfectly.
Actually not yet. We still need a way to remember that the Windows world generally tends to use little-endian. In fact the idea of little-endian is so strongly embedded in Windows that several of the .NET stream classes are out of the box only capable of reading little-endian data. And that is oh so very useful when you have to write code to read – say, to pick a random example – an ESRI shape file that contains big-endian data. But I digress, that’s for the next TechieSimon article.
Let’s get back on topic and think: How you remember that MS is little-endian?
My solution is: Just remember (again) that stuff in programming is warped and counter-intuitive. From the point of view of a human being, little-endian makes no sense whatsoever, because it’s the opposite of how you normally read numbers. But if you remember that everything is warped in programming then you’re fine. (Note: This works if you’re doing Windows desktop programming. Obviously, it’s not going to help you if you’re working on some other architecture that does in fact use big-endian).
So, there we have it. Little-endian = Little bits first because the terminology is warped. MS uses little-endian which is hard to read, because computing is warped. Simple, huh!
But, having figured out to remember it, how do I really cement that in my brain?
Ah! Got it! In my experience, you understand things better after you’ve explained them to other people. So… I’ll write a blog about it! The act of writing a few witty put-downs about endian-ness should seal the knowledge deep in my brain, never to be forgotten.
And I hereby apologize that I have now written the blog. You’ve almost finished reading it, in fact. That means the blog is done and is publically available for anyone to read, so there is no reason at all for you to write one. That means you’ll never be able to remember which way round is little-endian without looking it up. Hah! Maybe it’s time for you to bookmark the techiesimon link that’s currently sitting in your browser’s address bar just above these words. Nah!
Next time: How to read big-endian in .NET. Extension methods to the rescue.