Wednesday, May 22, 2013

Les Poissons, Les Poissons

(hee hee hee, haw haw haw)

In WWII, the British were getting regularly bombed by V-1 and V-2 rockets. After a while, it turned out that certain neighborhoods of London were getting pelted a ton, while others weren't. Some people started to suspect that the Germans had exceptionally accurate bombing capabilities, and were avoiding neighborhoods where their spies lived.

In order to figure out what was going on, British statistician R. D. Clarke broke the city into a grid and counted the number of hits per area. His results almost perfectly matched what you'd expect from random chance, following what is known as the Poisson Distribution.

The Poisson Distribution is really good for figuring out the likelihood of a given number of events occurring over a fixed time or space, based only on an expected number of the same events. Clarke took the number of rocket hits and divided it by his number of grids produced, which suggested an expected average approximately 0.933 rockets per grid space. The Poisson Distribution for this suggests about 40% of London wouldn't ever actually be bombed, while almost 25% of London would get bombed two or more times, and this is exactly what happened. It turned out that the Nazis weren't accurate at all, and it was just random chance that determined who would get hit on any particular day.

Now a happier topic: happy birthday!

On the off-chance that it actually is your birthday and you're reading this, then this is quite the fortuitous event. Lucky me! Go have some cake instead of reading my blog, sillypants.

Birthdays are spread out over a long period of time, which makes them cool for demonstrating several fun statistics concepts. One of them is the always-fun birthday paradox, which says that it only takes a random group 23 people before you'd expect at least a 50% chance of having a shared birthday between any two of them. This number probably seems absurdly low, but it's completely true. In fact, this is one of the many fun party games you can try that involved math!

If you're a little bit like me, sometimes you like to wish your friends happy birthday, and facebook is a handy tool for reminding you to do that. If you're a lot like me, you might notice that there are extreme variations in the number of friends who have a birthday on any given day.

Obviously if you have fewer than 365 Facebook friends you're guaranteed to have several days where nobody has birthdays. Once you get over 365 friends, though, you might expect that your friends' birthdays would even out, and that eventually you might run out of days with no birthdays.

But as you may have noticed, some days tend to have way more birthdays than others. For instance, 7 of my friends have birthdays on January 28th, and 6 of my friends have birthdays on January 5th. Does this mean that a bunch of my friends' parents got frisky in April?

Turns out... probably not at all. I have 546 Facebook friends who have their birthdays listed, meaning that on average I should expect 1.496 birthdays per day. Intuition would suggest I should get mostly 1-2 birthdays per day, with the occasional day with 0 and sometimes 3. Can the distribution of number of birthdays be predicted, though?

Yes! This sort of problem appears absolutely perfect for the Poisson Distribution. In fact, when I went through and jotted down the number of birthdays on each day of the week, this is what I got:


It's impressive how close the two are. Also, maybe counter to what you'd expect, a full 20% of the year has no birthdays. My Facebook friends fit the Poisson Distribution with a Coefficient of Determination of 0.984.

In reality, the fact that there are days where 6 or 7 of my friends have birthdays isn't exceptional and rare, but expected - it would actually be weird if there weren't any. In order to expect no days with 0 birthdays, you would have to have over 2,150 friends - expecting an average of almost 6 birthdays per day.

So that's your fun statistics thought of the day - whenever you have a large enough random sample, what normally might seem like an outlier actually tends to confirm just how random it really is.

Thursday, May 16, 2013

Homeopathy: Worse than "Just Water"

In 1796, German physician Samuel Hahnemann first proposed the idea of homeopathy. Based on the concept that "like cures like", homeopathic remedies attempt to cure symptoms suffered by patients by using highly diluted concentrations of a substance that would normally cause the same symptoms. However, these remedies are ineffective, wasteful, and can be damaging if they stop people from seeking real medical attention.

At first glance, the principle behind homeopathy may not seem that far-fetched. For instance, some vaccines are basically just preparations made from a non-life-threatening version of a disease in order to prepare your body to fight off that disease, and vaccines are great. Homeopathy takes this concept – fighting an effect with its own cause - way too far though, well past the point of ridicule.

An actual example from the Canadian Society of Homeopaths is the use of onion juice as a remedy for hay fever. Onions cause runny noses and itchy eyes, and hay fever causes runny noses and itchy eyes. Homeopathy claims that since both of these cause the same symptoms, onion juice can also cure the symptoms of hay fever.

Let me reiterate: homeopathy claims that using more of something that causes your problems will end up curing your problems. They literally claim that two wrongs make a right. Similarly, they claim that oysters can cure indigestion, arsenic stops diarrhea, and mercury can cure chronic pain.

If homeopathy doesn't already seem ridiculous, it's about to. It's quite obviously not a good idea to ingest arsenic and mercury, and homeopathic remedies certainly wouldn't sell if they immediately killed the people who bought them. This is where the second major claim of homeopathy comes into play: the more a remedy is diluted, the more potent its healing powers will be.

While having the obvious advantage of avoiding killing people by directly poisoning them, the dilutions used in most homeopathic remedies don't help the plausibility of homeopathy as a practice. Homeopathic remedies are commonly prepared by performing a series of dilutions by a factor of 100, where the number of dilutions is referred to as the C number of the remedy. For example, they could take one millilitre of an ingredient, add it to 99 millilitres of water and shake it, then take one millilitre of that and add it to a new 99 millilitres of water, and get a 2C solution.  The original number of dilutions proposed by Hahnemann was 30C - a series of thirty 1% dilutions. This is an equivalent ratio of one part active ingredient in 1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000  (1 novemdicillion) parts water.

For reference, a 6C homeopathic solution of salt water is supposed to help with irritability, but that's 40 billion times less concentrated than the salt in the ocean. The amount of liquid that's passed through an average 80 year old by the time they die dissolved in the combined water of all the world's lakes, oceans, and rivers is around 8C. What's intriguing is that if you started with a solution containing on mole of original material, at a dilution of 12C there's only a 60% chance of finding a single molecule of the active ingredient, and at a dilution of 13C, you're looking at pure water. A common over-the-counter homeopathic remedy at 30C isn't just essentially water - it is water.

The most popular homeopathic remedy sold is Ocsilloccinum, a 200C dilution of duck liver that is supposed to help with the flu. There are approximately 1080 atoms in the universe, so in order to have enough water to get one molecule of duck liver extract in a solution at this dilution, one would need 10320 universes worth of water. Somehow this is still considered a very potent concentration. Oscillococcinum is labelled as consisting of 0.85 g sucrose and 0.15 g lactose – these are 100% sugar pills.

Homeopathic practitioners can't easily argue with math, and most will readily admit that there is no active ingredient present in the remedies that they sell to patients. Instead, they claim that water has a "memory" and that the dilution and shaking process involved in preparing a homeopathic solution leaves an imprint on the water, thus changing its properties.

There is absolutely no evidence for this. Logically, it also doesn't hold up. As has been pointed out by countless comedians and scientists, it doesn't make any sense that water could remember the tiny amounts of poison once dissolved in it, but forget all the sewage it's come in contact with. Both the ability for water to retain an imprint of a chemical, and its ability to selectively forget other chemicals, would violate basic fundamental laws of physics.

So homeopathy is based on a principle that doesn't make sense, at concentrations that are nonexistent. Yet it is still around. In the United Kingdom, the Royal Homeopathic Hospital has the Queen as their official patron, and homeopathic remedies like Oscillococcinum in France are one of the most common treatments for the flu. Why?

One leading theory is that any perceived benefits of homeopathic water are likely due to a mix of confirmation bias and the placebo effect. Studies have shown that just the act of going to a clean hospital-like environment, listening to a compassionate doctor (or someone perceived to be a doctor), and taking something that you believe to be a medicine can often help with certain conditions. Placebos can have a measurable effect on improving health in certain circumstances, and are well-enough studied that we know a fake needle is more effective than a fake capsule, which is in turn more effective than a fake pill.

Confirmation bias likely also plays an important role in people's opinions on homeopathy. As an example, homeopaths describe a process known as 'homeopathic aggravation' - a temporary worsening of symptoms following a dose of a homeopathic remedy. When a patient takes the dose and then starts to feel worse, the homeopath can claim that's all part of the plan, reassuring the patient and convincing them that the remedy is working. Of course, an alternative explanation for homeopathic aggravation is self-evident - a patient notices they have a runny nose, takes some non-medicine, and then the rest of the flu hits and they claim that's the 'aggravation'. Well, no - that's just how diseases work when they aren't subject to actual medicine.

After hearing these arguments, a homeopath may ask to agree to disagree and claim that since the remedies are so dilute they can't possibly have side effects, but they bring comfort and maybe benefits to the people who buy them, surely there’s no harm in offering it as an alternative medicine.

The harm comes when people spend money on homeopathic remedies, mistaking them for real medicine, instead of going to real doctors. The placebo effect can have noticeable effects on health, but isn't generally capable of curing cancer. Patients with serious medical conditions who forego proven treatments from real doctors put themselves at severe risks they otherwise wouldn't need to experience, much like those who choose to avoid vaccines or sunscreen.

Homeopathy sold alongside real medicine, courtesy of your local Safeway.

Homeopathic remedies do not need to pass the same testing requirements as real medicine to be sold in Canada, but they can still be sold deceivingly on a shelf in a grocery store beside real medicine as though they are equally valid options. Someone desperate for symptom relief who doesn't know any better could easily mistakenly be buying sugar pills and pure water, costing them money and delaying real treatment. This is where the real harm of homeopathy comes from.

Drug regulations require strict testing for a reason, and if homeopathy cannot provide evidence that it works or at least a plausible working mechanism then it should not be portrayed and sold commercially as an equally viable form of treatment.

Monday, May 13, 2013

NHL Playoffs: Two Weeks In

Hey there!

The playoffs have been going on for two weeks now, and I am pleasantly surprised to say that they've been going pretty well in terms of what my model has output. In fact, of the six series that have wrapped up so far, the team that won each one of them was given the highest probability by my model. For instance, my model originally gave the following:

Blackhawks (77.0%) to beat Wild (23.0%)
Red Wings (59.1%) to beat Ducks (40.9%)
Sharks (62.2%) to beat Canucks (37.8%)
Kings (56.9%) to beat Blues (43.1%)
Penguins (64.2%) to beat Islanders (35.8%)
Senators (84.1%) to beat Canadiens (15.9%)

It also predicted the following at the outset:

Rangers (62.1%) to beat Capitals (37.9%)
Bruins (76.7%) to beat Maple Leafs (23.3%)

These last two series will be wrapped up tonight, and hopefully I can keep my success streak up. Currently, though, with a 6-0 record I am very pleased with the model so far. Wish me luck!

Today's post is gonna look a little bit about some of the behind-the-scenes math that goes into this model.

What's really important is to be able to take the odds of winning an individual game and convert those into the odds of winning the series as a whole. Fortunately this can be done pretty easily using a binomial distribution.

It turns out that there are a grand total of 70 ways for a best 4 out of 7 series to work out. They break down as follows:

  • 2 ways for a 4-0 (or 0-4) shut down (12.5% chance if teams are even)
  • 8 ways for a 4-1 or 1-4 finish (25.0%)
  • 20 ways for 4-2 or 2-4 (31.25%)
  • 40 ways for 4-3 or 3-4 (31.25%)
Because NHL playoff series allow for between 4 and 7 repeated games, any advantage that a team has in an individual game gets compounded. For instance, a 50/50 chance of winning a particular game translates to a 50/50 chance of winning the series, but a 60/40 chance of winning a game becomes a 70/30 chance of winning the series as a whole. This can be visualized as follows:


The way that I've set up my model allows for the number of games previously won to factor into the probability for the series, which is convenient for allowing the model to update every day following the results from the previous nights' games. The effect of having a game in hand looks something like this:


One other factor that could have an effect is home team advantage. The series get close to balancing out the number of home games between the two teams, but whenever a series ends on an odd number of games the team who had the first home game ought to have an advantage since they've had more home games, right?

Looking at the last 3 seasons of the NHL, 54.55% of games are won by the home team and 45.45% of games are won by the away team. If we factor this into the model, we get something like this:


Well that's not much of an advantage at all, is it? Probably a good thing.

So there you go. See you again next week!

Monday, May 6, 2013

NHL Playoff Predictions

The NHL playoffs are upon us, and for the third time I'm dusting off my Excel playoff model to see if I can predict who's going to win.

As it stands (as of May 6th, 2013), my model predicts that the most likely final will be between the Ottawa Senators and the Chicago Blackhawks. Altogether, though, the top four teams are the Senators, Blackhawks, Bruins, and Sharks (collectively these account for a 77% chance of winning the whole thing).

One of the ways that I've been presenting the daily updates from the model is as follows:

As time progresses (along the bottom), the height of each colored segment represents the relative probability of that team winning. For instance, when the Senators lost on May 3nd, their bar shrunk noticeably, and grew again after they won on the 5th. Again, the Bruins, Senators, Blackhawks, and Sharks account for a massive amount of the graph (and hopefully don't lose on the first round... that would be awkward).

So what makes me think I'm anywhere near accurate? If you asked me in person, I'd scratch my head and shrug a little. Particularly concerning are the long odds offered to some of the teams I predict to have a good chance of winning offered by sites like SportsClubStats and Bet365.

There are a couple of suggestions that I'm not totally inaccurate, though. Here are some of the results from previous years:
2010: Only correctly predicted the Blackhawks halfway after they started leading in the semi-finals. Maybe not the best prediction...

2012: Predicted the Kings six weeks before they won, once the Blues started to slide a little bit. More surprised about the Eastern conference, though, where the Devils admittedly were not predicted to do all that well.

Of course, the toughest part when it comes to checking how accurate a model is is actually coming up with an objective way of measuring that accuracy. Sure, the Kings won last year, but they only had a 13.5% chance starting out. 13.5% is high relative to other teams, but not really all that great overall. Can I really call it a win that a team with a 13.5% chance to win at the outside beat a bunch of teams at 5-10%?

One way to evaluate accuracy is to use a Brier score for each team, and take an average of all of them over time. A slightly modified Brier score would give a score of 1.0 to a 100% prediction that comes true, and 0.0 if it fails, with various decimal values in between based on what the given prediction was beforehand. If we compare the results from last year's model to what we would expect from pure chance, we get this:

So that's cool. Almost the whole way throughout the playoffs last year, my model gave more precise estimates of who's going to win than chance (assuming every game has a 50-50 chance of going either way). Part of the reason the score is so high near the end is that some teams have already been eliminated, and therefore would have a "perfect" prediction score (even though that's a bit silly). If we remove these teams, we get something more like this:

There are three distinct dips in the graph that represent the end of each round of playoffs. The scores dip because the predictions would get more general (open-ended playoff series making things less predictable, etc.). Even accounting for all this, my model last year was still significantly and consistently above chance. Fancy!

So who knows if the Senators will actually win. It'd be pretty cool if they did, though...