Combating Data Manipulation III: Polling

Polls are incredibly important and widely used, particularly during election season, and many people take those numbers to heart without thinking about where they came from.  People assume that when people saying Obama’s approval rating is 47%, 47% of the US population support Obama.

A fascinating subset of science has been dedicated to examining the error in polls, especially the portion of error that is ignored or misrepresented.  There are a few parts to error in polls:

(1) Margin of error: this is what polls mean when they show the error in the poll.  Whenever a poll is conducted, it comes with the assumption that the people sampled in the poll represent the entire population, which is obviously never true.  Therefore, the margin of error is the error that comes from having too small of a sample size that it cannot accurately reflect the entire population.  This can be statistically measured and is almost always mentioned when the poll is presented.

(2) Systematic Error (as described by Charles Seife in Proofiness; Wayne Journell refers to this as sampling bias): these are very rarely brought up and can alter poll data a huge amount.  The best example described by both Charles Seife and Wayne Journell involves the 1936 presidential election of Franklin Delano Roosevelt vs. Alf Landon.  The Literary Digest had become famous for having the most accurate polls in the country (they claimed their error was “within a fraction of 1 per cent” (Seife 105), and before the election they provided conclusive polls saying Alf Landon was going to win.  They had a massive sample size, but their result was incredibly wrong.  Why was it so wrong?  It has to do with what Journell refers to as convenience sampling, when pollsters tend to contact the population that is the easiest to get in touch with.  For example, for the 1936 Literary Digest poll, they contacted people by phones and using car registration data.  Unfortunately, this occurred during the Great Depression and a huge population of America did not have cars or phones, and those that did were rich and tended to, in that time, vote overwhelmingly Republican.  Secondly, it fell victim to the volunteer bias, which indicates that the people with the strongest opinions tend to respond to polls because they want their voice heard.  In the case of this election, Roosevelt was the incumbent and therefore the people who supported him were happy with the state of politics as it was, and therefore tended to not feel the need to respond to polls.  Those unhappy with the state of the country contributed more to polls. Thus, the Literary Digest had a teeny margin of error, but such a huge systematic error as to completely throw their poll off.



Another example involves the 1948 presidential election of Dewey and Truman.  Gallup conducted a poll to determine who would win the election weeks in advance because Dewey had such a significant lead.  Gallup was so confident in his results that he stopped polling.  However, the result was again incorrect; Gallup assumed that the undecided voters would vote just as the decided voters did.  This turned out not to be the case and the result of the election was very different from the result of his poll.



It is important to note that polls are also subject to pure randomness; if a poll samples one population one hour, and then again the next hour, there is a good likelihood that the results may be different (maybe because the people changed their minds in an hour).  There is also almost a guarantee that if you sample two different populations, even if they are equally varied and diverse, you will get different results.  This is just due to random factors that affect these polls and, while these are usually not strong enough to completely throw off a poll (like in the previous examples), they do play a role in error.

Here’s another example (again from Seife’s book) that demonstrates response bias, or in other words when what people say to polls does not accurately reflect what they believe or what is the truth.  CDC in 2007 conducted a poll about the sex lives of Americans; the results said that the average man has sex with seven women, but the average women has sex with four men.  But…that makes no mathematical sense.  Assuming that they only polled heterosexual people, the female average cannot be different than the male average.  Even though they really calculated the median and not the mean, the probability of the two numbers being different is still very small.  Current analysts suggest that the discrepancy could be due to societal pressures causing people to change their answers from the true number.

While a lot of the previous example is speculation on the part of analysts (we can’t really show that people were lying, although in 2003 apparently some researchers showed that if you ask people how many sexual partners they have and then ask them again once attached to a lie detector machine, the number changes), it does bring some interesting concepts to light.  Polls are definitely not foolproof and these examples illustrate different ways that polls can be inaccurate despite having really good sample sizes (and thus small margin of error).  A pretty funny story Seife mentions involves an Associated Press poll for what Americans see for 2007.  AP published two articles on this one poll.  Headlines?  “Americans optimistic for 2007” and “Americans see doom, gloom for 2007.”  Here are the links: and  As you can see, the margin of error of 3 percent was mentioned at the bottom along with the sample size (because Associated Press made the false assumption that overall relevant error can be calculated from just the sample size).  This just shows that polls can be really deceptive and outright wrong sometimes, and it is very important to understand the parts that go into formulating a correct poll so we know what questions to ask next time we encounter an article based on poll results.

I encourage everyone to take a look at the links in the works cited section (it’s always important to verify information, also they’re pretty interesting) and, if this stuff has interested you, pick up Seife’s Proofiness.

Works Cited

Wayne Journell, Interdisciplinary Education: “Lies, Damn Lies, and Statistics: Uncovering the Truth Behind Polling Data”

Robert W. Pearson, Statistical Persuasion

Charles Seife, Proofiness

Leonard Mlodinow, The Drunkard’s Walk: How Randomness Rules Our Lives

Combating Data Manipulation II: Error

Every measurement has some error associated with it.  If anyone read this statement, they would most likely say “I knew that,” and most people do understand that on a basic level, measurements contain error.  However, the extent to which this error affects our lives is not widely understood.  Error has wide implications in almost every aspect of life, but particularly in polling and voting (things that will be discussed in more detail in the next post).  This post is dedicated to simply understanding the mathematics of error.

So what is error?  Error is a measure of the uncertainty inherent in a measurement.  Every measurement has some amount of uncertainty (think about looking at a ruler, you can measure 4.3 inches, but you can’t be sure whether it is 4.32 inches or 4.33 inches because the ruler does not have tick marks to indicate that.  Therefore, that is known as the uncertainty in that measurement).

Understanding the nature of error and data requires knowing the findings of Daniel Bernoulli in the late 1700s.  He took sets of data from completely different contexts (e.g. astronomy data and archery data) and hypothesized that the distribution of data would be very similar.  This, in fact, turned out to be the case; the “real” value always fell somewhere in the middle, with increasingly smaller amounts of data as you get farther away from the center.  This is known today as the “bell curve,” or a normal distribution of data.  In this bell curve, the center of data represents the mean.

So, how do these findings of Bernoulli (along with other mathematicians, such as Abraham De Moivre) relate to error?  Well, the normal distribution actually reflects the error, because if the bell curve is very spread out, the standard deviation, and thus the error, is much higher.  Therefore, standard deviation is a measure of how spread out the data is.  Ratings of movies, politicians, restaurants, and even grades are all subject to error; that is, when the “measurement” is made and then repeated, there is a high chance that you will receive different numbers, and standard deviation is a mathematical measure of how varied that data is.

For example, (from Mlodinow’s book, cited below) a few years ago it was announced that we had an unemployment rate in the United States of 4.7%, and then a few months later it had changed to 4.8%.  News sources declared that unemployment is slowly rising.  However, this is not necessarily true; these measurements are subject to error, and therefore there is no way to tell whether that 0.1% was due to a true rise in unemployment or just due to error.  Mlodinow states that if the unemployment rate was measured at noon and then remeasured at 1 PM, the likelihood is that the number would be different by a little bit due to error, but that does not mean that unemployment rose in an hour.

In the next post, I will go into more detail about how the idea of error applies to polls and voting!

Works Cited

Leonard Mlodinow, The Drunkard’s Walk: How Randomness Rules Our Lives

Per Request: How do bones regenerate?

I got a request to write a post on how bones regrow or grow for the first time, so here goes:

There are 3 different types of cells in bone.  Osteoblasts are responsible for making and putting down new bone, osteoclasts are responsible for taking up and getting rid of old bone, and osteocytes are osteoblasts that have been trapped inside the new bone and then work to maintain the bone after it has been constructed.  So, here’s what happens:

The osteoclasts drill through bone tissue to clean it out, and then osteoblasts follow behind depositing new bone tissue on the sides of the inside of the bone, creating this new tissue that surrounds either a blood vessel or a nerve.  When a bone is broken, the first thing osteoblasts lay down is what’s called “woven fiber” bone tissue; this tissue is much faster to create and lay down (providing quick structure) but is not as strong.  Later, because it takes more time, osteoblasts replace that woven fiber tissue with “lamellar” bone tissue, which is nicely organized in sheets that is very strong. 

So, a sensible follow up question would be: how does bone grow?  A bone begins as cartilage and then layers of bone begin to develop around it while cells destroy and take up the old cartilage.  Then, cells begin to move to the ends of the bone towards the “active growth centers,” where they stretch out the bones.  After the bone has finished growing, the cells mineralize the bone to make it tough and strong.  When a bone mineralizes, it ceases to grow forever.  So, when humans stop growing, that is because all of our bone has been mineralized.

Comment if there are any questions, but that explains the basics about how bone tissue can regenerate and how bones can be fixed.

Combating Data Manipulation I: Common Mistakes in Data Presentation

The first part of combating data manipulation is recognizing when a graph or chart or data set is misleading in some way.

(1) To understand the first way this can happen, we need to understand “what is measurable”?  Something that is measurable has a unit (e.g. length can be measured in inches, weight in grams, etc.).  Many advertisements and commercials make use of numbers and measurements that functionally mean nothing, because they are numbers with no units.


So Pantene promises two times more shine in hair; how is shine measured?  What does that mean?  This number “two times” does not mean anything, but it sounds much more persuasive than saying “you’ll get some more shine in your hair.”



Here is another example.  Maybelline promises a 65% lift to the eyelashes, but how do we measure lift in eyelashes?  This is persuasive, but definitely not meaningful.

These numbers with no meaning are purely used for persuasion and the first, and probably the easiest, use of numerical manipulation that we can spot.

(2) The second way people manipulate data to persuade is known as “cherry-picking.”  This is when certain data is presented, but other data that does not support the point is ignored.  Charles Seife in Proofiness presents an example in George Bush’s descriptions of the apparent success of the No Child Left Behind Policy.  He claimed that students were in general improving their test scores, ignoring data from 12th grade students whose test scores actually decreased.  Secondly, those scores that did increase have actually been increasing slowly for decades, way before the passage of the legislation.  Bush cherry-picked data that supported his point and his policy.

(3) The third way is comparing data that should not be compared (e.g. comparing data from different years when the conditions were very different).  Seife uses the example of the Blue Dog Coalition criticizing the Bush Administration for borrowing a lot of money from foreign nations.  They claimed, “Throughout the first 224 years (1776-2000) of our nation’s history, 42 U.S. presidents borrowed a combined $1.01 trillion from foreign governments and financial institutions according to the U.S. Treasury Department. In the past four years alone (2001-2005), the Bush Administration has borrowed a staggering $1.05 trillion.”  While the comparison of numbers sounds incredibly convincing, if you think about it, the value of the dollar in 1776 was vastly different than the value of the dollar today or in 2005, and therefore this comparison does not make a whole lot of sense.  This is a common problem with the comparison of test scores from different years; Michael Winerip in The New York Times describes the mayor of New York boasting about the increase in test scores for fourth graders, when many of the educators actually admit that the test just got easier.  Therefore, the comparison across years is inaccurate.

(4) Finally, the last type of data manipulation is just changing the graph itself.  If one axis of the graph represents only 0-1 (of whatever unit the graph is displaying), it would look vastly different than if the axis went from 0-10 or 0-50.  Making sure that the axes and type of graph accurately display the data is extremely important.

Up Next: an in-depth discussion of error (especially as it relates to polls and voting)!

Works Cited

Charles Seife, Proofiness×3162826

Combating Data Manipulation: Introduction

After reading a few books on data presentation and statistics and probability theory, I have decided to create a 10-post series about things everyone should know about data.  As Charles Seife mentions in the first few paragraphs of his book Proofiness, putting numbers to an opinion or thought tend to make people believe the opinion more than without that mathematical or numerical backing.  Seife, in his introduction, uses Joe McCarthy as an example of this: when he declared that the State Department harbored communists, his argument got even more persuasive when he declared that he could name 205 of them.  Afterwards, that number fluctuated from 207 to 57 to 81 to even less than those, but what mattered was that McCarthy had a number to back his statement up.

Sooner or later, everyone will encounter some data set, graph, chart, or statistic that is either made up entirely or somehow false.  Unfortunately, a lot of people end up basing opinions or votes on these mathematical falsehoods simply because the skills of identifying data manipulation are not widely taught.

These posts will have applications in statistics, economics, political science, as well as many other fields, approached from a mathematical and statistical perspective.  And, while I will be using books such as Seife’s Proofiness and Leonard Mlodinow’s book The Drunkard’s Walk: How Randomness Rules Our Lives, I will verify that the information I post here is as accurate as I can make it.

Works Cited:

Charles Seife, Proofiness

Leonard Mlodinow, The Drunkard’s Walk: How Randomness Rules Our Lives

Organic Foods: The Definition

Are organic foods better for you than non organic foods?  Are non organic foods dangerous because they use pesticides and herbicides?  The answer to these questions lies, in some part, in the definition of organic farming.

The definition is, surprisingly enough, extremely loose.  “Organic foods” does NOT mean that pesticides or fertilizers were not used.  Organic farming cannot use synthetic pesticides or fertilizers and it cannot use genetically modified organisms (for more information, see the GMO’s page).  While this may seem like organic foods are safer because synthetic pesticides can be dangerous, this is not necessarily the case.  To put it in perspective, arsenic and cyanide would be “natural” pesticides, and therefore allowed under the definition of organic farming.  While there are certain exceptions in the organic farming definition (which include banning arsenic and cyanide), there are equally poisonous chemicals that are completely natural and therefore allowed under organic farming, but should definitely not be ingested by humans (such as hydrogen peroxide).

So, the bottom line: just because something is organic does not mean that pesticides and dangerous chemicals were not used.

More in-depth posts on the benefits and disadvantages of organic foods to come!