Extreme Enginerding: 2017

Monday, October 23, 2017

Edmonton Election 2017

Another election has come and gone, and apart from a handful of new faces the biggest news is all the new stats! Let's take a look:

First of all, turnout was abysmal. A total of 194,826 people voted, resulting in a voter turnout of 31.5%. The best (blue) and worst (red) areas of the city in terms of voter turnout are shown here:

The colouring of the map is a bit funky since the mean and median are rather far apart, but it gives a decent impression of what happened. In general, it looks like neighborhoods around the river valley voted more often than neighborhoods away from it, which is interesting. The massive difference between the high (66.9%) and low (9.3%) turnout is absolutely astounding to me, and might suggest fairly significant challenges with connecting with voters in certain areas (especially if they can't see the river, apparently...).

Voter turnout can also be measured in a few other ways, including attrition along the ballot. For instance, of everyone who voted, 1.5% neglected to vote for a mayoral candidate, and 1.9% neglected to vote for any council candidate. 26.3% of voters picked a Catholic schools ballot vs. 66.6% Public ballots, and even then 6.5% of Catholics and 9.8% of Publics didn't end up voting for a school trustee anyway. Oddly enough, the total number of Catholic + Public voters doesn't equal the total number of voters, so I'm not entirely sure where the remaining 7.1% of voters did for school board...

Lighter colours represent 'under votes', or people who didn't make a pick for that particular round of voting.

Don Iveson was re-elected mayor with a solid victory. His support levels in Edmonton aren't dissimilar from last election, and are shown here (darker colours meaning higher support).

Iveson's support in general seems very solid in the center of the city, and a bit weaker in the north and southeast than the rest of the city. All that being said, his support ranged from 59.5-85.9% so he has a strong mandate from every part of the city.

Finally, similar to last election, I've taken a look at which councillors' support correlates most or least with the mayor's. Last year, it turned out that a general pattern emerged where the councillors whose support most often correlated with high mayoral support also generally agreed with the mayor on votes. This year, the correlations between councillors and the mayor are:

I'd say this supports the theory from last election - last term, McKeen, Esslinger, Knack, Walters, and Henderson all voted alongside the mayor on more than 80% of non-unanimous votes, while Banga, Caterina, and Nickel (76%, 75%, and 46%, respectively) agreed with the mayor less frequently. While the mayor has had a strong track record of gaining majority support for non-unanimous bills, it does seem as though the candidates who do better in polls where the mayor does worse to tend on average to disagree with him more often than not.

That suggests that perhaps this council will be a little bit closer in voting record than the last one - the four new councillors all showed up in the middle of the pack for mayoral correlations, so likely either they are wildcards for agreement with the mayor, or as new candidates their reputation hasn't yet been tested. Only time will tell!

Monday, June 26, 2017

Edmonton City Council Gender Parity

Back in October I took a quick look at the success rates of female candidates getting into city council. In 2013, 22% of candidates were female, but only one out of the twelve council seats ended up being held by a woman. The aim of that post was to investigate some of the source of the gender disparity on council - namely whether the distribution of female candidates in different races was causing the issue, or whether there was an inherent bias against female candidates.

Ultimately, I determined that the relative lack of successful female council winners was more likely due to distribution of candidates across races than individual bias - without accounting for incumbency, there was no evidence of anything other than relative equal chances of winning between female and male candidates (i.e the number of female winners since 2004 is more or less what you'd expect assuming all candidates are equally likely to win).

That was a pretty positive sign, as it suggests that the biggest factor holding back a demographically-balanced council is the availability of under-represented candidates to run (which is totally outside of the scope of this blog to discuss), and perhaps more importantly, the avoidance of clumping of under-represented demographics into the same few races.

One of the biggest issues with the 2013 election was that five wards had no women running at all, and half of all women were clustered into two ridings. This drastically reduced the expected number of women into council, regardless of the relative proportion of candidates who put their names forward.

So with all that said, I've been keeping track of candidates for the 2017 civic election which are being tracked at Daveberta. For each candidate, I've tried to ascertain their gender in order by how they refer to themselves (political candidates love speaking in the third person), or how they're referred to in third party posts, and if all else fails by name and presentation assumptions. If you notice any errors, please let me know.

(Last updated September 19, 2017)

Based on the current 71 candidates, 23 are female and 48 are male (female ratio of 32.4%, up from 22% in 2013). However, based on the distribution between wards, an expected 3.89 seats will be won by female candidates, which could be considered a relatively inefficient allocation of seats based on the ratio of candidates. 2 wards have no women running at all.

Overall, it's most likely that the number of female councillors after the election will be between 2 and 6 (90% confidence).

Edmonton Council (32% female candidates)

Edmonton Catholic School Board (65% female candidates)

Edmonton Public School Board (39% female candidates)

Now that the official nomination deadline has passed, these numbers ought to be pretty official! All in all, women running for city council are still a bit poorly distributed, leading to an expected under-representation of about 0.15 seats. On the other hand, men tend to be poorly distributed in the school board races, leading to expected over-representations of 0.28 and 0.86 seats for Catholic and Public boards respectively. All in all, the candidate distributions are fairly balanced though, and this is certainly a fairer election gender-wise than 2013.

Tuesday, May 9, 2017

Next Game Wins?

(Subtitle: Which Game Should You Win? Part 3)

Three years ago, my friend Andrew pitched in to the blog and asked which game in the playoffs was most worth winning. The results were a bit inconclusive, but from it he developed a database of all playoff outcomes since 1943, so a year later I looked at the dataset again and developed Markov-style chains of playoff odds based on different positions in the playoffs.

Now that it's playoff season again, people are naturally interested more than normal in hockey and I recently overheard someone comment that, though a series was currently at 2-1 for wins, the next team to win was undoubtedly going to win the series.

Good thing I have this handy database of all playoff outcomes ready, because that immediately intrigued me as to how likely it actually is that, at any given point, the next team to win a game will win the series overall. This is perhaps another way of asking the same question as before - how much does this upcoming playoff game matter to the grand scheme of things?

Before looking into the historical data, though, it's worth doing the math to see what the odds would be if the human element were removed (with all games having a 50/50 chance of going either way, and all games being independent). Obviously, if a best-of-seven playoff series is tied at 3-3, then the next game winner is guaranteed to win the series, so that's an easy starting point.

From there, it's not too hard to work backwards to figure out the rest of the odds. If a series is at 3-2, then there's a 50% chance that the leading team wins (which would give them the series win, and a 100% chance therefore of winning the series), and a 50% chance that we get to a 3-3 position, where the chance of the trailing team being the overall winner is again 50%. Overall, that makes the chance that the next game winner will be the series winner (50%*100%)+(50%*50%)=75%.

If we continue this way, then we can generate this table of values. For all following graphs, the 'home team' is the team that has home town advantage for the first two games:

So what's not surprising here is that the odds that the next team to win will be the series winner are always above 50%. That makes sense, because no matter what the position is beforehand the winner is improving their overall odds of winning the series. What's more interesting is how little games tend to matter when the series is lopsided.

Of course, games aren't all independent or aren't all 50/50 toss-ups. Historically, home teams win 54.5% of games, so let's see what happens if we recreate this table with that factored in. It's a bit more complicated, but essentially the same analysis as before, to get this table:

Here we start to see the effects of the playoff structure and the pattern with which it allocates home games to different teams. For instance, when the original home team is up 3-0, the upcoming game almost doesn't matter at all, but the situation isn't quite the same if the original away team is up 3-0. Similarly, both 3-2 game situations have different values. This can be perhaps more easily rationalized - if the original home team is up 3-2, then the upcoming game is going to be in their opponent's home town, which makes it more likely that that other team will win, but if they do then it's tied coming back home, so that's less of a big deal. On the other hand, if the original away team is leading 3-2, they're more likely to win this upcoming game 6, and can lock the series up right there.

Of course, this is all fun and games from a theoretical point of view, but what's actually been happening in real playoff series? Here we go:

This is definitely more interesting! Here we have a clear outlier from the theoretical projections from before, where the 'least important' game is game 5 when the original away team is up 3-1. At this point, the original home team would be playing back at home, but would be down by such a significant deficit, resulting in a situation where they end up with a fairly high 'last hurrah' win rate, before ultimately losing the series 2-4.

On the other hand, there's a surprisingly high predictive score for whoever wins the game after the original home team gets up 1-0, at 74% (8% higher than what you'd expect in a coin toss scenario). I imagine this indicates that the original home team is likely to win their first game, and that if the original away team can't bounce back then the series is likely sorted out by that point (at least, in harder-to-quantify matters than you'd expect).

So the answer to the question 'which playoff game is the most important' remains a solid "it depends", but now you have three different ways of looking at the question. Use them wisely, and enjoy the 2017 playoffs!

Monday, February 6, 2017

How often does the best team win the championship?

Imagine we have a four team single-elimination tournament. Team A is good enough that you'd expect them to win about 80% of all games, Team B ought to win about 60%, Team C should win 40%, and Team D ought to win about 20% of games (against random opponents). If we seeded a single elimination tournament with these teams, it could look something like this:

Given the information above, how likely is it that the best team in this tournament, Team A, ends up being the winner? In other words, how effective is this tournament structure and seeding system at determining the best team out of the four?

The first tool we'll need for this is the Log5 formula - given the true winning percentages of two teams, this formula tells you the odds of a given team winning. So for instance, Team A playing Team D is pretty lopsided, and Team A has a 94.18% chance of winning that game based on this formula. Similarly, Team B and Team C are much closer in relative skill, so Team B only has a 69.23% chance of winning that game.

Based on all of this, we can come up with the relative chances of any of the four teams winning the tournament overall:

So in our hypothetical situation, this single-elimination four-player seeded tournament with known team skillsets resulted in the best team, Team A, winning 72.26% of the time, and the worst team winning 1.06% of the time. Not shabby.

However, what if you didn't know the skill levels of the teams going into the tournament? How confident could you be that the eventual winner of the tournament was, in fact, the best team when they signed up? One way to determine this is by running a Monte Carlo simulation - let's sign up four teams of random skill values, run them through a tournament exactly the same way as we just did with our sample teams, and see who the winner is. Then let's do that 10,000 times, and see how often the best team wins.

The results are interesting: with randomly drawn skill values for all teams (mean= 0.5, stdev=0.13 [see below**]), we'd expect the winner of the tournament to be the strongest team only 44.3% of the time. About 10.5% of the time, the weakest team in the tournament would end up winning the whole thing!

So is a single elimination tournament a particularly good way of determining the best overall team from a pool of four teams? Probably not. What happens if we up the ante, and have a double elimination tournament? Double elimination tournaments are exactly what they sound like - any given team needs to lose twice before being eliminated. They tend to look something like this:

This sort of format ought to improve the chances of the best team winning, as a single unlucky (and unlikely) loss won't eliminate them too early. If we run the same sort of analysis, but with a double elimination tournament, we end up with the winner being the strongest team 51.1% of the time, and the winner being the weakest team only 7.7% of the time. A reasonable improvement all in all.

Unfortunately, this modest increase in chances of determining the truly best team is offset by the increased length and uncertainty in the tournament. A single elimination tournament needs three games total, whereas a double elimination tournament needs either six or seven games, depending on how the sixth one goes. This is also annoying to schedule and sell tickets for, as organizers have no way of knowing if the sixth game will be the exciting final or not.

Expanding a bit, we can do the same analysis with single, double, and triple (!) elimination format tournaments for tournament of as many teams as we want. Before we continue, though, I'll mention that for a 32 team single-elimination randomly-seeded tournament, the odds that the winner will have been the best team overall are 22.0%. Keep that in mind.

So why talk about these tiny little tournaments? It's to get you ready for the real deal: professional sports.

Professional sports leagues generally tend to fall into a regular season and playoffs, where the regular season is used to seed teams into some sort of order and filter out the best, who end up playing in a tournament bracket of one style or another. What happens if we do the same sort of analysis for each major (and some minor) sports league?

National Basketball Association

Overview: 30 teams play 82 games each in the regular season. Teams are sorted into conferences and divisions. The playoffs are a 16-team single elimination tournament bracket, where each round of the playoffs consists of a best-of-seven games series for elimination. Teams are seeded within conferences, and the bracket is fixed at the end of the regular season. This gives us:

Overall odds of the NBA Championship being won by the season's best team: 45.9%.

Pros: Best-of-seven series in playoffs reduces variability in results. Not seeding teams based on division standings, and only looking at conference standings, reduces the chances of weaker teams getting into the playoffs by virtue of leading weaker divisions.

Cons: Relatively long season for regular season, large range in potential length of playoffs.

National Hockey League

Overview: 30 teams play 82 games each in the regular season. Teams are sorted into conferences and divisions, and are more likely to play teams inside their divisions and conferences than outside of them. The playoffs are a 16-team single elimination tournament bracket, where each round of the playoffs consists of a best-of-seven games series for elimination. Teams are seeded within divisions such that the top three teams of each division are guaranteed a spot in the playoffs, and the highest-performing two teams of the conference that remain get in as wildcards. All of this results in the following distribution:

Overall odds the Stanley Cup winner was the season's best team: 45.4%.

Pros: Best-of-seven elimination in playoffs reduces variance and increases the chances of better teams triumphing.

Cons: Relatively long season for regular season, large range in potential length of playoffs.

Fun fact: Only real difference between NHL and NBA results is the seeding into the playoffs and how wildcards are handled, and that results in hardly any change at all.

Major League Baseball

Overview: 30 teams play 162 games each in the regular season. Teams are sorted into leagues and divisions, are are more likely to play teams inside their divisions and leagues than outside of them. The playoffs consist of the winners of each division and the two wildcard runners-up from the conference, and is a wonky sort of 10-team single elimination tournament where the first round are two single game wild-card playoffs, followed by four best-of-five division series, then two best-of-seven league winner series. Finally, the winners of each league play each other in a best-of-seven series to determine the winner. From this, we get:

Overall odds the World Series winner was the season's best team: 45.4%.

Pros: Teams are more likely to be correctly seeded heading into playoffs due to extensive regular season. Fewer teams in playoffs makes it less likely for weaker teams to get lucky.

Cons: Short playoff season with sudden-death games and best-of-five series increases variance.

Fun fact: If all rounds (including wildcard) in the MLB playoffs were best-of-seven series, the odds that the winner was actually the best team increase to 46.4%.

National Football League

Overview: 32 teams play 16 games each in the regular season. Teams are sorted into conferences and divisions, and are more likely to play teams inside their divisions and conferences than outside of them. The playoffs are a true 12-team single-elimination tournament, where each division leader and two runner-up wildcards are seeded from each conference. This results in:

Overall odds the Superbowl winner was the season's best team: 28.2%.

Pros: Short, fixed playoff schedule is predictable.

Cons: Single elimination sudden death games greatly increase the chances of weaker teams winning by chance and eliminating stronger teams. Relatively short season also doesn't guarantee accurate seeding of teams heading into playoffs.

Fun fact: Before, I mentioned that the chance of the winner of a randomly-seeded 32-team single-elimination tournament being the best team was 22.0%. That means the NFL season format isn't really all that much better than just having one six-week March Madness style showdown each season.

Some more minor tournaments that are still near and dear to my heart:

Canadian Football League

Overview: Nine teams play 18 games each in the regular season. Teams are sorted into two divisions. The playoffs are a single-elimination 5-team tournament, where the highest-ranked team in each division gets a bye to the division finals. Teams can cross-over into other divisions if the fourth-place team in one division has more points at the end of the regular season than the third-place team in the other division. This gives us:

Overall odds the Grey Cup will go to the season's best team: 38.0%.

Pros: Short, fixed playoff season is predictable. Still somehow have a longer regular season than the NFL. Higher ranked teams getting a bye to the division finals helps out the stronger teams.

Cons: Single elimination playoff format, which includes more than half the league, leads to a bit of a crapshoot. It's disappointing than the NHL with over four times as many teams is better suited to finding its best team each year.

Fun fact: Again, this isn't substantially better than just running a 9-team randomly-seeded single elimination tournament right at the beginning of the year. It's more fun, though.

Curling

Overview: Major curling tournaments involve 12 teams, who play each other once each in a round robin. The top four teams are seeded into a Page playoff system, where the top two teams are in quasi-double elimination system, and the remaining teams are in single elimination. This gives us:

Overall odds the winner was the tournament's best team: 37.3%.

Pros: Fixed playoff system is short and predictable. Page playoff system gives a bonus to the teams who perform best after a fair and balanced round-robin.

Cons: Single elimination format of playoffs increases variability.

Fun fact: Curling is fun and you should try it.

So there you go. Unsurprisingly, sports leagues with longer regular seasons and best-of-seven playoff series are better suited for determining the actual best teams each season, whereas leagues with shorter seasons and single elimination tournaments as less well-suited. Now you have numbers to show for it, at least!

Thursday, February 2, 2017

Redistricting in Alberta

Every eight to ten years, the Alberta Electoral Boundary Commission meets up to reconfigure the riding boundaries for upcoming elections. This is a fairly important part of our democracy, as cities are often growing, people tend to move around a lot, and keeping our boundaries updated to reflect current population distributions is a good way to ensure that everyone is equally represented in our legislature.

Fortunately enough, Alberta isn't like many US states where it is elected politicians themselves who decide where these boundaries will be drawn, which can tend to lead to gerrymandering as I've discussed before. So although in Alberta districts are determined by a neutral committee and appear to avoid obvious signs of manipulation for political purposes, they aren't really all that good at making sure everyone's vote counts the same no matter where they live. For instance, the Electoral Boundaries Commission Act specifies that the maximum population deviation from the average per riding is 25%. As well, provided the area is sparsely populated enough, up to 4 districts can have populations that are as much as 50% below the average population of a riding in the province.

This led to a situation where, based on the 2011 census, the largest district had a population of 51,800 people, more than twice the size of the smallest district at 23,050. Someone living in Dunvegan-Central Peace-Notley has nearly twice the voting power of the average Albertan when it comes to provincial elections, at a population a whopping 45% below provincial average. (As a side note, this still isn't as bad as on the federal stage, where Labrador is 73% below average, and five times less than the highest populated riding in Brantford-Brant, but that's a different story.)

So, as an infomercial might say at this point, "There must be a better way!"

The Electoral Boundaries Commission is accepting submissions now while they begin their redistricting process, and this seems as good a time as any to determine a better solution. What is a fair way to split the province up into 87 sections each with the same population?

One of the coolest solutions is to use the shortest splitline algorithm. As explained by CGP Grey, the shortest splitline algorithm is a repetitive process that searches for the shortest line that splits an area perfectly in two by population. Each half is then split again with the shortest line that produces equal halves, until ultimately we stop when we've gotten the desired number of sections split up, which are necessarily of exactly even population.

So lets try this for Alberta. The first thing we need is a population distribution of Alberta, which Statistics Canada helpfully has lying around on their website. It looks like this:

This is Alberta broken up into 5,711 census dissemination areas based on the 2011 census.

Next up, we would normally find the shortest line that crosses Alberta in such a way that exactly half of the population is above the line, and exactly half is below the line. Since Alberta has 87 districts, though, we actually want to find the shortest line that has 44/87 of the population above it, and 43/87 of the population below it. In my (slightly optimized) model, that looked like this:

Then we split each half again. The top half is an even number, so we can split it in two easily, whereas the bottom has to be split into 22/44 and 21/44 segments. That gives us this:

And so on and so forth until we've split all of Alberta up into equal segments. The final result of this ends up being this lovely stained glass window:

Of course, things can't always be perfect no matter how hard you try, so this is a solution for Alberta that has a maximum population in each riding of 0.38%. The largest riding has 42,052 people in it, and the smallest has 41,752. This is a solution to split up Alberta that has a maximum voter variance that is 118 times smaller than we have now, and a coefficient of variation that is 68 times smaller.

Also, just because the map was drawn with straight lines doesn't mean it has to stay like that. If we go back to our census dissemination area shapes from Stats Canada, we can convert an Edmonton distribution from this:

To this:

Which is actually starting to look pretty reasonable. Neighborhoods are kept together, and the areas are looking relatively compact.

The shortest splitline method is an objective and fair way to distribute votes such that everyone's votes are counted equally. I was able to redistrict all of Alberta using Excel - no fancy programming skills are needed. There's no reason that we can't have redistricting being as boring as updating census data and having a computer spit out a single solution each time we need it.

That being said, there are still some objections people could have with it - for instance, it doesn't necessarily give a hoot about municipal boundaries. Take Red Deer for example: after applying the algorithm to Alberta, Red Deer got sort of unfortunately split into four districts, each of which includes substantial amounts of surrounding countryside:

Oh no.

This would have people in Innisfail voting alongside southwest Red Deer, and people in Saskatchewan River Crossing voting alongside west Red Deer. That's probably a bit messed up. In cases like this, I think it's probably fair to use the splitline algorithm to get to a starting point, and then massaging the districts as needed to make sure they make a bit more sense. Because the algorithm got within a 0.38% maximum population variance to start with, it is likely quite straightforward after to swap around some areas as needed to keep the popluation variance still small. For instance, Public Interest Alberta recommends a maximum population deviation of 5%, which I'd suggest is reasonably easy to achieve if we're starting from a point where our deviation is essentially 0%.

So if I've convinced you that using algorithms to redistrict our population can lead to fairer, objective, even distributions of our political districts, and that those are things worth having in our democracy, head on over to the Commission's website and leave them a submission before February 8th!

Extreme Enginerding

Labels