Additional text

Recommended browser for this blog: Chrome

Follow Economystified on facebook

All posts by Dan Whalen (LinkedIn, Github)

Monday, June 9, 2014

Median vs Mean

The Average

Most of us learned our way around a mean early on in life.  When we were kids, a decent chunk of our lives were governed by a mean.  I'm referring, of course - to your averages at school.

Traditionally, we use the mean to calculate a students overall mark for a class.  Back when you were a kid, your mean grade (colloquially, your "average") on tests, quizzes, and homeworks, determined what you'd be doing with your summer, the timbre of your relationship with your parents, and how much time you'd be spending around your friends at school.  Your world revolved around a mean!

Later on, maybe not even until college, you learned about another type of average: "the median."  You memorized how to calculate it, maybe had to answer a question about it on a Stats 101 exam - and then promptly forgot all about it.

What you probably never got from your teacher was an analysis of what the two numbers signify.  Why have multiple types of "averages" anyhow?  When should I use the median, and when should I use the mean?  And when I see means and medians in print, in the news - in "the wild" - how should I interpret them?

In short, you probably know how to find a mean and a median, but don't really know how to make use of them.

Aw, but don't worry, man.  It's not your fault.  Unfortunately, this is a common problem in our math curriculums - we teach kids calculation tricks, but rarely manage to teach the thinking behind them.

So here's my little crash course on the median and the mean.  Read to this post's end, and you'll not only know the difference between the two, but you'll know enough to distinguish when and where to use 'em!

The mean

I'll start off with the mean, since that's the one you're more familiar with.

Look at this data series:

3, 4, 5, 1, 3, 18, 2, 6

Let's say you want to "simplify" this set, to distill it down to its summarize it with a single number.

To do that, we can calculate its mean.  We add up all the numbers, then divide the sum by how many there are.

3+4+5+1+3+18+2+6 = 42
42 / 8 = 5.25

So the mean of this series is 5.25.

The median

Ok, starting with the same series:

3, 4, 5, 1, 3, 18, 2, 6

To get the median, arrange the data in value order (ascending or descending, it doesn't really matter).  The median will be the number dead center.

1, 2, 3, 3, 4, 5, 6, 18

Now, since our data series has an even number of values (and therefore there are two values at the middle), the median will be whatever number would be dead center.

Because 3 and 4 are our central values, we split the difference, and go right in between the two.  Our median is 3.5.

What's the difference?

A mean of 5.25, and a median of 3.5.  Why is the mean so much higher?

The problem is that pesky 18.  Three times bigger than the next largest value (the 6).  That 18 is kind of an outlier, and it sways the results of the mean.

Since calculating the mean requires giving exactly equal weight, equal significance, to every number in the series, the 18 skews our average high.  It's the point where the upper half of the data, and the lower half, are "in balance."

On the other hand, the median gives 100% of the significance to whatever number happens to be sitting in the middle seat.  Therefore, the extremeness of the tail ends of the series has no effect on the median.

The lowest number in our example series could have been .0001, and 1000 the highest, and the median would still be 3.5.

And that's really the point of having a mean, AND a median.  The mean reflects all values in the series, and so its influenced by the degree of inequality, the degree of skew, in the data series.  It's for telling the story of the whole set.  

The median tells a story of the typical, usual, normal, middle-of-the-road guy. 

Both the mean and the median are measures of "central tendency," a measure of the approximate "center of mass" of the series.  The numbers can swing high and low, but always around that central anchor.

Use the median if you just want to talk about where the anchor is.  Use the mean if you want to talk about the entire range through which the numbers swing.

My personal rule of thumb

So I kinda lied.  There's no "rule," per se.

It all depends on the situation.  There's not really a hard and fast answer to when and where using the median is "correct," and where the mean is the "correct" answer.  It all depends on finding the right way to tell the story you want to tell.

But I will give you my opinion.  My shortcut, my heuristic.  Don't put this as an answer on your test.  It's not the official, "right" way of thinking about the subject.  This is just between you and me.

My rule of thumb is:

Use the mean when comparing group A to group B.

Use the mean when comparing the individual members of group A to each other.

To illustrate

Want to know how much shorter the Kindergarten class is than the First Graders?  Compare the mean height of the Kindergarteners to the mean of the First Graders.  

But is it the case you want to know if Jimmy is particularly short for a Kindergartener?  Then compare his height to the median height of the Kindergarten class.

Want to know if homes in Buffalo are pricier than homes in Boston?  Compare mean home prices.

But if you're just curious how your house stacks up against your neighbors', I'd use the median for the mark of comparison.

Take a moment to logic it out, and you'll see where I'm coming from.

My reasoning

First, the median is a good tool for pitting members of a group against each other because it tempers the effects of abnormal extremes.

One weirdly tall Kindergartener, one ridiculously expensive home on the block is all it takes to skew a mean.  Sammy is hitting a growth spurt, which doesn't mean the rest of his class to physically shrinking!!

This is why economists pretty much always talk about incomes and house prices in terms of medians.  If they used the mean instead, growth in inequality would look like growth for all, for example.  One guy's extreme fortune would make it look like everyone is doing better.

Economists stick to medians to keep the typical, common income or home - the kind that most people actually have - at the center of the story.

Second, if you compare your income to the mean income of your peer group, you'd be sorta "double-counting" yourself.

Here's what I mean: Say you weren't making much.  Your low income is already bringing the mean income of your peer group down.

Now you compare your income to that mean.  Because your income is skewing the mean down, your income won't look as low as it actually is.  Your income is pulling the mean down closer to itself, making the group "typical" salary look a lot more like your own.

Compare it to the median instead, and you'll be looking at where you actually fall in the income distribution.  Like, if this were a competition, what place you'd be in.  If we ranked everyone by income, the median would be the guy right in the middle.  How many spaces do you have to move up to meet him?

Ultimately, my reasoning comes down to this: the mean is the story of the total value in the set.  The median is the story of the ranking of those values.  The mean is about the group, the median is about the individuals in it.  So use them accordingly.

But in the end, it's on you

Either way, that's just my two cents.  The big take-away here should be that though they are both "averages," the mean and the median are different.  Be cognizant of that.

And when you see someone throwing these values around, just stop and think about how they were calculated.  I promise, that is the easiest way to verify to yourself what the figures actually are telling you.

A side note on Statistics...

The odd thing about stats is that there doesn't seem to be many hard and fast rules to it.  

The rules can be a little murky.  There's a reason for this though: statistics aren't exactly "real" numbers that fit just right in the "real" world.

To understand why, keep in mind the fact that all "statistics" are a man-made numbers.  Statistics do not occur in nature.  

There's no mountain range that limits its own peaks to an "average" elevation, no dog population that designates every 10th pup exactly to be the runt.  Mankind uses statistics to describe what's already out there in nature.  Nature doesn't bend itself specific ways in order to conform to their stats.

Statistics is just our way of making real data artificially easy to work with.

Think of it this way:  Imagine in the distant future, humans evolve these giant, veiny, Megamindy heads, with super powered brains inside of them.  

Let's say that hominid is able to look at a messy table of numbers - columns and rows of gobbeldy-gook forever - and immediately see/pick out trends.  This guy read streams of raw data the way you and I read books.

They would see what's going up, what's coming down.  They'd understand immediately if variable A is more volatile than variable B.  It would be obvious to them the range values are moving within, and with what speed and degree they fluctuate.

BUT WE ARE NOT THIS GUY.  Our minds aren't evolved to a point where we can look at a list of numbers, and glean some relevant information from them.

We simply do not digest huge quantities of numbers this way.  We need summaries.  We need the over-arching trends.  We need "just the gist."  We need to transform our data into simpler forms before we can swallow it.

We need statistics.

"Statistics" is the art of summarizing numbers.  It's how we can turn a series of data into a single, easy to digest factoid.  

Statistics are tools humans invented to make working with figures easier.  

As such, we should be careful how we think about them.  The average height of a Kindergartener is not the actual height of your Kindergartener.  The median income of a citizen of your city is not your actual income.

It's something of its own class entirely...


  1. You appear to have a typo in your 'Rule of Thumb' section... both lines refer to the mean. ;p

    So, yes... you are correct to say there's no hard rules in statistics. Statistics is like a big toolbox, and which one makes sense depends on the question you want to answer and the data at hand.

    However, the thing to take away is that the mean and the median often don't tell a whole story together. They are two minor tools in the toolbox: if the mean and median are really disparate, it could mean all sorts of crazy things about your data, like maybe its multi-modal or maybe its really skewed. You need other tools in the toolbox to start to differentiate those things.

    Furthermore, the arithmetic mean (BTW there's a multiplicative mean) is grounded strongly to our understanding of normal distributions, and our typical assumption that datasets of many types conform to that distribution. Central tendencies can be misleading if, for example, your data is unfiromly distributed).

    Which is the really key thing: the tools in the statistical toolbox don't work without their batteries, and their power source is ASSUMPTIONS. You can't say squat if you aren't willing to say something about the numbers beyond that, yep, those are the numbers there...

    Taking the mean implicitly assumes that your data is a random sample of an infinite amount of data with a distribution that looks normal-ish and isn't skewed and is unimodal and not bound strongly at zero or anything like that. And maybe those are fine assumptions and maybe they aren't. The median makes fewer assumptions (it has more in common with so-called 'non-parametric' statistics) which is why its useful, but the why and the how and when each is most useful is really up to the user. Ultimately, statistics is more like philosophy and logic than it is like math.

  2. And statistics is a lot more than parameter estimates, like taking the mean, by the way. Opposed to your bit about statistics being 'the art of summarizing numbers' (ehhhhh), I'd say it is more 'the science of figuring out how to answer questions with data'. A lot of what real statisticians do is inventing new methods to test hypotheses and figuring out how to deal with the various biases of different data-types.