Lots of people get units wrong. Please don't be one of those people.
This is a tutorial introduction to getting your computing units right. The history of these issues is covered in extensive detail at Wikipedia.
File or Data Size
Everyone knows that computers store things as 1s and 0s. These are called bits. (Short for binary digits).
All modern computers represent things like letters and numbers with a group of 8 bits, which is called a byte. That is; there are 8 bits to every byte.
Convention is to use a lower case b to refer to quantities of bits and an upper case B to refer to a number of bytes. 220B refers to 220 bytes.
If we want to talk about lots of bytes there are SI prefixes available to us:
- k (kilo) 1,000
- M (mega) 1,000,000
- G (giga) 1,000,000,000
- T (tera) 1,000,000,000,000
So it is perfectly okay to talk about 936 kB, 1.24 Mb, 50 GB, referring to kilobytes, megabits and gigabytes respectively. It is not okay to talk about 936 mb because the prefix m (milli) means one-thousandth of a bit. You can't have parts of a bit. Please don't use the wrong case.
Binary prefixes
Confusingly, often people use terms like kB, MB and GB while meaning something slightly different.
When a computer stores a bunch of bytes in memory it tends to group them into chunks of powers of 2. This is because it needs to be able to give each spot in memory an address and it's easier to build hardware conforming to:
All addresses of the form 000000xx xxxxxxxx are stored in chip 1, 000001xx xxxxxxxx are stored in chip 2, etc.
In this example every chip has 10 bits of address space and can store 1024 bytes. If you were lazy, you would say that each chip stores 1 kB of data. But that means 1,000 bytes, not 1,024!
Similarly, with 2^20 bytes you get approximately one million: 1,048,576. So you can't say that this is a "megabyte" really, because that would mean exactly one million bytes.
There are two main solutions to this problem. The first and most common is to ignore it. This entails using kB, MB, GB, etc. and letting the person reading it guess exactly how many bytes you mean. This works most of the time because usually an approximate file size is good enough. Will the file fit on my flash drive? About how long will this take to download? Near enough is good enough.
But I'm an engineering student so I find this imprecision annoying, and sometimes we need to be explicit about what we mean. For this we have a method developed by the International Electrotechnical Commission:
- Ki (kibi) 1,024
- Mi (mebi) 1,048,576
- Gi (gibi) 1,073,741,824
- Ti (tebi) 1,099,511,627,776
If you ever see numbers like 3 KiB, 2.3 GiB, etc., you know exactly how many bytes they really mean. Better yet, if you always use those forms to represent the powers of 2, you also know that the ordinary SI prefixes always mean powers of 10.
The only downside is that it sounds ridiculous to say "mebibytes" out loud. But hey, it might be cool some day. It's only been around since 1996. If you look carefully, some computer software will express quantities in MiB or GiB. DC++ is an example.
Transfer Speeds
There are two main things to say about speeds. First of all, a transfer speed is how much data you can transfer in a given amount of time. Therefore the speed is an amount of data per second.
When speaking out loud it's convenient to talk about a "one-point-five megabit connection" but when written it should be "1.5 Mb/s".
Secondly, standard Internet connection speeds are quoted in bits per second. When you download files in your browser it usually shows you bytes per second. If you have a 1500 kb/s connection, to estimate your download speed in bytes per second you need to divide that number by 8, which is 187.5 kB/s. In reality it won't be quite that fast, but there's a big difference in the number depending on whether you're using bits or bytes.
Conclusion
I'm not as pedantic as some people. I have a lecturer who will happily write distances in Mm (megametres) instead of thousands of kilometres like almost anybody else would do.
All the same, the chances are you'll run across these numbers quite a lot and if you know exactly what they mean, more power to you. I will also be happier if I am reading something you've written and don't have to make mental conversions.
Finally and most importantly: sometimes it can be ambiguous. There is such a wide range of internet connections available these days that a download speed of 1 MB/s or a download of speed of 1 Mb/s could both be reasonable. If you're working out how long something will take to transfer you can't rely upon common sense to guess the correct form.
7 Comments
I think we should give up on written SI prefixes. Scientific notation ftw!
2,000,000 m = 2,000 km = 2 Mm = 2e6 m
You prefer binary prefixes? Ok, let's do that then.
2,048 B = 2 KiB = 2b10 B
Except that it isn't, at least in computing terms, very rarely ambiguous: a 1m file over a 1m connection is clearly a 1MB file over a 1Mb/s connection, since data rates are always bits/s and file sizes are given in bytes.
Dammit, grammar.
It isn't ambiguous. Not 'very rarely', it just isn't.
Only pedants care, because they like to care about such things.
I can think of two situations in which it would make sense to talk of fractional bits: when talking about averages of variable word-length encodings, and when using derived units, such as b s^{-1} or b m^{-2}. 936 mb could be interpreted as a 0.936 probability that a bit was transmitted, writable etc.
Anonymous #2, I thought you were talking about a one-metre file over a one-metre connexion.
http://en.wikipedia.org/wiki/Entropy_(information_theory)
Sorry for the spam, but this article is really cool. It has CW in it too :)
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
There was a comment on Twitter too:
desplesda: @atan2 "You can't have parts of a bit" - sure you can, fractions of a bit are used all the time in information theory! #annoyingpedantism
Obviously that's correct. In derived units or in probability you can talk about fractional bits. My own pedantic response would be that I said you can't _have_ part of a bit, which I think implies static possession well enough. So there. :P
Awesome link to Shannon's paper, Andrew. I've been hearing a lot about that chap lately in my communications unit at uni, mostly in the context of maximum data rates through a channel. I'll definitely read it carefully sometime.