Monday, October 31, 2016

Political Polls and margin of error


At the time of my writing this, it is about a week until the election. It's Trump vs. Clinton - Duel of the Century. I thought I would look into political polls as a mathematics application this week. Specifically, let's look at what is called the margin of error.

I looked at a couple of what I think are reputable websites. However, they seemed to not quite get this concept. For example, something like this was stated by a couple of sites:

A poll states that candidate A is at 52% with a margin of error of +/- 3%. This means the candidate could actually be polling anywhere from 49% to 55%.

Unless I've been lied to in my past math classes, I believe this is wrong information. This is a common misconception, but I didn't think I would find news agencies writing this.

He is what I believe is the correct scoop. Polls usually have a confidence level. Part of the confusion is when CNN, NBC, etc mention their polls, they don't talk about this. Anyway, for most polls it is 90%. So, in actuality, a much truer fact is that there is a 90% chance that that candidate A is between 49% and 55%. She (or he - I'll stick with "she" the rest of the way so I don't have to mention both genders each time. Why "she" rather than "he"? I flipped a coin. Seriously.) is probably in that range, by she can't be certain of that.

You can never be certain of polls. Common sense tells you that you can't have absolute certainty. If there are millions of voters in the U.S., and your survey covers a few thousand, how do you know you didn't just happen to survey only ones that are against candidate A. Yes, unlikely, but it could happen. So if a poll states A is ahead of B, 57% to 42% with a margin of error of 5%, it's all over, right? No, it isn't. It's not looking good for B, but it's not all over.

We see surveys during election years a lot, but we see them often at other times without knowing it. The government's unemployment reports, bestselling books, the top TV shows for the week are all done by random sampling of a relatively small sample.

Students could figure out the margin of error. It goes like this:

Margin of error = z x squareroot(p(1-p)/n). The z-value is based on how accurate you want your poll result to be. You would have to look that up. The value of p is your polling result and n is the number in your sample. (Oddly, the number in your total group, whether it is the entire U.S., the state of Oregon, or your bowling league, has nothing to do with the answer.)

Common sense tells us that there is a trade-off. The more exact you want to be, the wider your interval is going to end up being. I might be able to state, from a recent survey of adult males, that I am 90% certain the average height of all adult males is between 5'7" and 5'11". One the other hand, if I want to be 99.99% certain, I might only be able to state that the average height is between 3' and 8'. You gain in certainty and you lose in precision.

Let's try one out.

We polled 1,000 people. Of those, 560 said they would vote for Candidate A. So, she is polling at 56%. We want to be 90% certain of the range her number would actually land in. Looking up the 90%, we find a z-value of 1.645.

1.645 x squareroot(.56(1-.56)/1,000) = .0258. If we round it to 2.5%, she is 90% sure of her actual number being between 53.5% and 58.5%.

Just for fun, here are some other possibilities.

Suppose we chose a confidence level of 95%:

95% corresponds to z = 1.96, so
1.96 x squareroot(.56(1-.56)/1,000) = 3.1%, giving a range of 52.9% to 59.1%

Suppose we take our original example and assume we surveyed twice as many people:
1.645 x squareroot(.56(1-.56)/2,000) = 1.8%, giving a range of 54.2% to 57.8%

I was right. That was fun.







Monday, October 24, 2016

Standard Deviations and Baseball

Its World Series time and I feel compelled to stick with a baseball theme this week. I've considered this application since I was not much more than a child. I wasn't sure how the math on it would work, and I'm still not certain, but I thought it would be worth exploring.

Batting averages are the ratio of hits to times at bat. So getting one hit in four times up to bat gives a batting average of .250.

It would make sense that the overall batting average in baseball might vary over the years. Things have changed since it started in 1869. There used to be no night games. That is mostly because the electric light hadn't been invented yet. Night games have made it harder for hitters. Although, they've outlawed spit balls. That has made it easier.

Does it all even out? Apparently not. There used to be quite a few batters that hit .400 or better for a season. No one has done that in the past few decades, though. I've wondered if there a way to even things out mathematical. I've seen some attempts at this.

I found a person's website that has the major league batting average for each season. Over a century worth, it is at .263. The highest year ever was 1894 when it was .309. So maybe a player that year could have their batting average dropped by .046 (.309 - .263 = .046). Similar adjustments could be made for players of each year.

Not a bad idea. I've seen other similar methods. However, I've thought that some measure of variance should come into play. I've had a theory that the standard deviation of the batting average statistics have been going decreasing over the years. So, there were more .400 hitters in the past, far above the league average, but I would guess that back then there were also more hitters far below the league average.

Why might that be? Now there are scouts going to colleges, high schools, Japan, Dominican Republic, etc. looking for possible talent. In the early days, they took what they could get. It wasn't necessarily the best baseball talent. Someone might come in from the coal mines, look pretty good, and you sign him to a contract. Over the years the process has improved.

To take a shot at that proving my theory, I used a website that showed the league average for each year. I then entered twenty years worth of yearly batting averages and found the standard deviation. Its not perfect, but I think it kind of backs me up. Here we go:

1871-1900  Standard deviation = 15.91
1901-1920  Standard deviation = 10.66
1921-1940  Standard deviation = 7.38
1941-1960  Standard deviation = 3.76
1961-1980  Standard deviation = 7.91
1981-2000  Standard deviation = 5.84
2001-2012  Standard deviation = 5.15

So to really do this right, I probably should find the standard deviations of each individual year using each player, rather than using the year as a whole. However, that seemed like a lot of work, so I settle for this. I bet there is some data base that has all the averages and the capability of adjusting the mean averages and the standard deviations for each year and adjusting each player's batting average accordingly. It won't be me, but somebody should take that on.



Tuesday, October 18, 2016

No hitters

Sorry, but I can't help but go back to baseball stats for the next couple weeks. It is playoff time for baseball, so I can't really help it.

Clayton Kershaw had a no hitter going for a while a couple days ago. No hitters are pretty rare. I got to thinking that you could maybe estimate the chances of a no hitter. Let's say a team would normally bat 0.250 against you. That is, they would get a hit every four times at bat. What are your chances of a no-hitter? You need to retire 27 batters (3 in each of the 9 innings). The probability you retire the first batter is .75. The probability of retiring the second batter is 0.75 x 0.75. The probability of having a no hitter in just the first inning is 0.75 x 0.75 x 0.75 or 42.2%.

For the whole game, the probability of a no-hitter would be (0.75)^27 = 0.000423. Unlikely.

You can give up walks or have batters reach on errors and still have it count as a no-hitter. I don't think we need to take that into account, though, as they do not count as official at-bats anyway. I had to think about that a bit, but I'm pretty sure I'm right on that.

Then I thought about estimating how many there should be in a season or any given period of time. I went back to 1998 because that is the last year major league baseball added teams. Since then, to the present day, there have been 30 major league baseball teams. With 162 games for each team, from 1998 to 2016 (19 years) there have been 162 x 30 x 19 = 92,340 save opportunities. If we use the above probability, the number of no-hitters during that time would be: 92,340 x 0.000423 = 39.1 saves.

How many have there actually been? 49. Keep that number in mind. We'll compare other outcomes to that.

So, not bad. In fact, a lot closer than I thought it would be.

The big question mark in all this, I think, is the batting average. The overall major league average is a little higher than this, maybe 0.260. Doing the math again would give an estimate of 27.2 (lower than the aforementioned 39.1).

But maybe we shouldn't be talking about the league average. You would figure the type to get a no-hitter is a better than average pitcher. And in fact, looking at the list of those that have thrown no-hitters shows some of the best pitchers of the past 20 years - Jake Arrieta, Max Scherzer, Cole Hamels, Clayton Kershaw, and Justin Verlander. (And there have been some pitchers that had some talent, but also a good amount of luck on their side that day of their no hitter.)

So maybe the correct batting average would be 0.240 -- 55.9 no-hitters.

Or maybe a batting average would be 0.230 -- 79.6 no--hitters.

Anyway, for those somewhat interested in the topic of baseball this was an interesting math application on baseball and probabilities.

Monday, October 10, 2016

Running Pace

I'm running in a race in about a month. Its a half marathon, which is 13.1 miles. I think I can make it, but I'm not absolutely certain of that. Being in shape for a race is not really enough of a challenge for me. The only reason I'm doing it is for the scenery. It goes across the Golden Gate Bridge. Twice, in fact. So that will be an adventure in itself. I have run two other races in the past that are highlights in terms of the races themselves. One was in Knoxville, Tennessee that finished on the 50 yard line of the University of Tennessee stadium. The other was a half marathon in Indianapolis whose course included one lap (2.5 miles) on the Indy 500 track.

Anyway, on to math. Since I'm not 100% sure I can even finish, I'm definitely not sure what pace I should try to run. There was a predictor in my latest copy of Runner's World Magazine. They gave a way to predict your pace in various races by looking at times for shorter distances. I thought - good application.

The had predictors for the 5K, 10K, half marathon, and marathon. Since it applies to my situation, I'll just use the one for the half marathon.


  • The Workout - "Race" a 10K at 80 percent effort. 
  • The Formula - Take your 10K time in minutes (for example, a 55:30 is 55.5) and add 0.93. Multiply the result by 2.11.
  • When - Three to five weeks before race day.
  • Why - A 10K is great because it has that endurance aspect of a half marathon but doesn't require you to run too much so close to race day,
Yes, they could have condensed things quite a bit by using an equation rather than an explanation.

So the "formula" is f(x) =  2.11(x+0.93), 

To take their example of 55:30, you would have a half marathon time of f(55.5) = 2.11(55.5+0.93) = 119.067 minutes or 1 hour 59 minutes 4 seconds.

To put a little more algebra into this, we could say that we are hoping to run the half marathon in one hour 50 minutes. What kind of 10K would predict that kind of time?

Answer:  110 = 2.11(x+0.93), so  x = 51.203 or 51 minutes 12 seconds.

The other three races; 5K, 10K, and marathon; have different, but similar formulas, and would be great for Algebra I classes.

Tuesday, October 4, 2016

Morse Code

I saw something about Morse code and thought it might be an interesting topic as a mathematics application.

First, some background.

Samuel Morse was born in 1791. He attended Yale, graduating in 1810. He aspired to be a painter. I didn't realize he did of this other career until I read about his paintings in David McCollough's book, The Greater Journey: Americans in Paris. Here is his portrait of President James Monroe.

He lost his wife and both parents in a three year span. As an escape, he went to Europe. During this time he made some contacts that led to led to the invention of Morse Code.

It didn't catch on for a few years. A U.S. congressman showed interest and a test was done with a wire stretching from Washington D.C. to Baltimore. He successfully asked, "What hath God wrought" and the rest is history.

It relies on a series of dots and dashes. They can be communicated with electronic impulses or light impulses. It was very important, but began to fall out of favor with the invention of Bell's telephone in which actual words could be used instead of a code for spelling out words. It is still used in various areas, including signal lamps by the coast guard. Those without speech can use the tapping of Morse code to communicate. But for the most part, it is found in history books.

SOS, for example, is ...---... How many are combinations of dots and dashes are needed to cover the alphabet? This could be found use the fundamental counting principal (If there are "m" ways to do one thing, and "n" ways to do another, there are "m x n" ways to do both.)

  • Using one symbol means a dot or a dash could be used - two choices.
  • Two symbols means there are 2 x 2 = 4 ways.
  • Three symbols means there are 2 x 2 x 2 = 8 ways.
  • Four symbols means there are 2 x 2 x 2 x 2 = 16 ways.

Since there are 26 letters in our alphabet, this still isn't enough. We could use five symbols, but that makes it more cumbersome. I can be done, though, by using one, two, three, or four symbols. Since 2+4+8+16 = 30. That is plenty to cover the whole alphabet.

If we need more that just words - digits, or symbols like ? and ;, we are going to need more. So for them, we need to use 5 symbols. How many possibilities would that give us?

Two to the fifth power is 32, and that means we have 2+4+8+16+32 = 62 possibilities. That gives us enough for 26 letters, 10 digits, and 26 more symbols beside.