Combating Data Manipulation


After reading a few books on data presentation and statistics and probability theory, I have decided to create a 10-post series about things everyone should know about data.  As Charles Seife mentions in the first few paragraphs of his book Proofiness, putting numbers to an opinion or thought tend to make people believe the opinion more than without that mathematical or numerical backing.  Seife, in his introduction, uses Joe McCarthy as an example of this: when he declared that the State Department harbored communists, his argument got even more persuasive when he declared that he could name 205 of them.  Afterwards, that number fluctuated from 207 to 57 to 81 to even less than those, but what mattered was that McCarthy had a number to back his statement up.

Sooner or later, everyone will encounter some data set, graph, chart, or statistic that is either made up entirely or somehow false.  Unfortunately, a lot of people end up basing opinions or votes on these mathematical falsehoods simply because the skills of identifying data manipulation are not widely taught.

These posts will have applications in statistics, economics, political science, as well as many other fields, approached from a mathematical and statistical perspective.  And, while I will be using books such as Seife’s Proofiness and Leonard Mlodinow’s book The Drunkard’s Walk: How Randomness Rules Our Lives, I will verify that the information I post here is as accurate as I can make it.

I. Common Mistakes in Data Presentation

The first part of combating data manipulation is recognizing when a graph or chart or data set is misleading in some way.

(1) To understand the first way this can happen, we need to understand “what is measurable”?  Something that is measurable has a unit (e.g. length can be measured in inches, weight in grams, etc.).  Many advertisements and commercials make use of numbers and measurements that functionally mean nothing, because they are numbers with no units.



So Pantene promises two times more shine in hair; how is shine measured?  What does that mean?  This number “two times” does not mean anything, but it sounds much more persuasive than saying “you’ll get some more shine in your hair.”



Here is another example.  Maybelline promises a 65% lift to the eyelashes, but how do we measure lift in eyelashes?  This is persuasive, but definitely not meaningful.

These numbers with no meaning are purely used for persuasion and the first, and probably the easiest, use of numerical manipulation that we can spot.

(2) The second way people manipulate data to persuade is known as “cherry-picking.”  This is when certain data is presented, but other data that does not support the point is ignored.  Charles Seife in Proofiness presents an example in George Bush’s descriptions of the apparent success of the No Child Left Behind Policy.  He claimed that students were in general improving their test scores, ignoring data from 12th grade students whose test scores actually decreased.  Secondly, those scores that did increase have actually been increasing slowly for decades, way before the passage of the legislation.  Bush cherry-picked data that supported his point and his policy.

(3) The third way is comparing data that should not be compared (e.g. comparing data from different years when the conditions were very different).  Seife uses the example of the Blue Dog Coalition criticizing the Bush Administration for borrowing a lot of money from foreign nations.  They claimed, “Throughout the first 224 years (1776-2000) of our nation’s history, 42 U.S. presidents borrowed a combined $1.01 trillion from foreign governments and financial institutions according to the U.S. Treasury Department. In the past four years alone (2001-2005), the Bush Administration has borrowed a staggering $1.05 trillion.”  While the comparison of numbers sounds incredibly convincing, if you think about it, the value of the dollar in 1776 was vastly different than the value of the dollar today or in 2005, and therefore this comparison does not make a whole lot of sense.  This is a common problem with the comparison of test scores from different years; Michael Winerip in The New York Times describes the mayor of New York boasting about the increase in test scores for fourth graders, when many of the educators actually admit that the test just got easier.  Therefore, the comparison across years is inaccurate.

(4) Finally, the last type of data manipulation is just changing the graph itself.  If one axis of the graph represents only 0-1 (of whatever unit the graph is displaying), it would look vastly different than if the axis went from 0-10 or 0-50.  Making sure that the axes and type of graph accurately display the data is extremely important.

II. Error

Every measurement has some error associated with it.  If anyone read this statement, they would most likely say “I knew that,” and most people do understand that on a basic level, measurements contain error.  However, the extent to which this error affects our lives is not widely understood.  Error has wide implications in almost every aspect of life, but particularly in polling and voting (things that will be discussed in more detail in the next post).  This post is dedicated to simply understanding the mathematics of error.

So what is error?  Error is a measure of the uncertainty inherent in a measurement.  Every measurement has some amount of uncertainty (think about looking at a ruler, you can measure 4.3 inches, but you can’t be sure whether it is 4.32 inches or 4.33 inches because the ruler does not have tick marks to indicate that.  Therefore, that is known as the uncertainty in that measurement).

Understanding the nature of error and data requires knowing the findings of Daniel Bernoulli in the late 1700s.  He took sets of data from completely different contexts (e.g. astronomy data and archery data) and hypothesized that the distribution of data would be very similar.  This, in fact, turned out to be the case; the “real” value always fell somewhere in the middle, with increasingly smaller amounts of data as you get farther away from the center.  This is known today as the “bell curve,” or a normal distribution of data.  In this bell curve, the center of data represents the mean.

So, how do these findings of Bernoulli (along with other mathematicians, such as Abraham De Moivre) relate to error?  Well, the normal distribution actually reflects the error, because if the bell curve is very spread out, the standard deviation, and thus the error, is much higher.  Therefore, standard deviation is a measure of how spread out the data is.  Ratings of movies, politicians, restaurants, and even grades are all subject to error; that is, when the “measurement” is made and then repeated, there is a high chance that you will receive different numbers, and standard deviation is a mathematical measure of how varied that data is.

For example, (from Mlodinow’s book, cited below) a few years ago it was announced that we had an unemployment rate in the United States of 4.7%, and then a few months later it had changed to 4.8%.  News sources declared that unemployment is slowly rising.  However, this is not necessarily true; these measurements are subject to error, and therefore there is no way to tell whether that 0.1% was due to a true rise in unemployment or just due to error.  Mlodinow states that if the unemployment rate was measured at noon and then remeasured at 1 PM, the likelihood is that the number would be different by a little bit due to error, but that does not mean that unemployment rose in an hour.

III. Polling

Polls are incredibly important and widely used, particularly during election season, and many people take those numbers to heart without thinking about where they came from.  People assume that when people saying Obama’s approval rating is 47%, 47% of the US population support Obama.

A fascinating subset of science has been dedicated to examining the error in polls, especially the portion of error that is ignored or misrepresented.  There are a few parts to error in polls:

(1) Margin of error: this is what polls mean when they show the error in the poll.  Whenever a poll is conducted, it comes with the assumption that the people sampled in the poll represent the entire population, which is obviously never true.  Therefore, the margin of error is the error that comes from having too small of a sample size that it cannot accurately reflect the entire population.  This can be statistically measured and is almost always mentioned when the poll is presented.

(2) Systematic Error (as described by Charles Seife in Proofiness; Wayne Journell refers to this as sampling bias): these are very rarely brought up and can alter poll data a huge amount.  The best example described by both Charles Seife and Wayne Journell involves the 1936 presidential election of Franklin Delano Roosevelt vs. Alf Landon.  The Literary Digest had become famous for having the most accurate polls in the country (they claimed their error was “within a fraction of 1 per cent” (Seife 105), and before the election they provided conclusive polls saying Alf Landon was going to win.  They had a massive sample size, but their result was incredibly wrong.  Why was it so wrong?  It has to do with what Journell refers to as convenience sampling, when pollsters tend to contact the population that is the easiest to get in touch with.  For example, for the 1936 Literary Digest poll, they contacted people by phones and using car registration data.  Unfortunately, this occurred during the Great Depression and a huge population of America did not have cars or phones, and those that did were rich and tended to, in that time, vote overwhelmingly Republican.  Secondly, it fell victim to the volunteer bias, which indicates that the people with the strongest opinions tend to respond to polls because they want their voice heard.  In the case of this election, Roosevelt was the incumbent and therefore the people who supported him were happy with the state of politics as it was, and therefore tended to not feel the need to respond to polls.  Those unhappy with the state of the country contributed more to polls. Thus, the Literary Digest had a teeny margin of error, but such a huge systematic error as to completely throw their poll off.



Another example involves the 1948 presidential election of Dewey and Truman.  Gallup conducted a poll to determine who would win the election weeks in advance because Dewey had such a significant lead.  Gallup was so confident in his results that he stopped polling.  However, the result was again incorrect; Gallup assumed that the undecided voters would vote just as the decided voters did.  This turned out not to be the case and the result of the election was very different from the result of his poll.



It is important to note that polls are also subject to pure randomness; if a poll samples one population one hour, and then again the next hour, there is a good likelihood that the results may be different (maybe because the people changed their minds in an hour).  There is also almost a guarantee that if you sample two different populations, even if they are equally varied and diverse, you will get different results.  This is just due to random factors that affect these polls and, while these are usually not strong enough to completely throw off a poll (like in the previous examples), they do play a role in error.

Here’s another example (again from Seife’s book) that demonstrates response bias, or in other words when what people say to polls does not accurately reflect what they believe or what is the truth.  CDC in 2007 conducted a poll about the sex lives of Americans; the results said that the average man has sex with seven women, but the average women has sex with four men.  But…that makes no mathematical sense.  Assuming that they only polled heterosexual people, the female average cannot be different than the male average.  Even though they really calculated the median and not the mean, the probability of the two numbers being different is still very small.  Current analysts suggest that the discrepancy could be due to societal pressures causing people to change their answers from the true number.

While a lot of the previous example is speculation on the part of analysts (we can’t really show that people were lying, although in 2003 apparently some researchers showed that if you ask people how many sexual partners they have and then ask them again once attached to a lie detector machine, the number changes), it does bring some interesting concepts to light.  Polls are definitely not foolproof and these examples illustrate different ways that polls can be inaccurate despite having really good sample sizes (and thus small margin of error).  A pretty funny story Seife mentions involves an Associated Press poll for what Americans see for 2007.  AP published two articles on this one poll.  Headlines?  “Americans optimistic for 2007″ and “Americans see doom, gloom for 2007.”  Here are the links: and  As you can see, the margin of error of 3 percent was mentioned at the bottom along with the sample size (because Associated Press made the false assumption that overall relevant error can be calculated from just the sample size).  This just shows that polls can be really deceptive and outright wrong sometimes, and it is very important to understand the parts that go into formulating a correct poll so we know what questions to ask next time we encounter an article based on poll results.

I encourage everyone to take a look at the links in the works cited section (it’s always important to verify information, also they’re pretty interesting) and, if this stuff has interested you, pick up Seife’s Proofiness.


IV. Voting

A little bit about how error interacts with voting:

Counting is difficult.  And yet, voting, which has such a huge impact in the United States, relies mainly on people’s ability to count.  And, as I’m sure most people know, counting large numbers of anything always leads to mistakes; inherent in every election is error.  Typically in elections, these don’t make much of a difference in declaring a winner because the error is much smaller than the difference in votes between the candidates.  This means that if you take the error (let’s call it x) and add plus or minus x to the number of votes a candidate received, it would not change who won the election.  It happens in every election that a few hundred votes go missing or are just miscounted, but typically these are ignored and never mentioned because they wouldn’t make a big enough impact on the election.

But there are those few elections in which the error made a huge difference.  The example that comes to most people’s minds is the Bush v. Gore election in 2000.  Many votes were miscounted because of poor ballot layout or because people filled out the ballots improperly.  As Charles Seife writes, “Even under ideal conditions, even when officials count well-designed ballots with incredible deliberation, there are errors on the order of a few hundredths of a percent.  And that’s just the beginning.  There are plenty of other errors in any election.  There are errors caused by people entering data incorrectly.  There are errors caused by people filling out ballots wrong, casting their vote for the wrong person…[b]allots will be misplaced.  Ballots will be double-counted” (163).  When the winner is unsure, the first thing people do is order a recount, not taking into account that every single time they count, there will be errors associated with the end number, even if those errors are different among recounts.  It is impossible to get the “correct” number of votes for each candidate.

According to Seife and many other political thinkers and mathematicians, the 2000 election was a tie; the error was too much bigger than the difference between the candidates to conclusively say one candidate one over the other.  As Seife writes, “It’s hard to swallow, but…the 2000 presidential election should have been settled with a flip of a coin” (166).  However, no one likes the idea that an election can be a tie, so we operate under the idea that we can find a real victor if we recount enough times.


V. Law of Large Numbers

You conduct two polls to determine the percentage of people in a town that like Italian food.  For one poll, you ask and collect data from 5 people and determine that 80% of the people in that town like Italian food.  For the second poll, you ask and collect data from 1,000 people and determine that 60% of the people in that town like Italian food.  Which poll is more correct?

We explored this idea a little bit in a previous post about polling, but almost everyone will agree that the poll that sampled 1,000 people is more accurate than the pol that sampled 5 people.  But why is that?  The answer lies in a theorem developed by Jacob Bernoulli, now known as Bernoulli’s Theorem, or the Law of Large Numbers.

The idea is that the more trials used, the more accurate the probability will be or, according to Wolfram, “as the number of trials of a random process increases, the percentage difference between the expected and actual values goes to zero.”

So what does this NOT mean?  Say we somehow know that exactly 62% of the people in our town from the example above like Italian food and after sampling 1,000 people, we have 60%.  Some may say (understandably) that the next 1 or 2 or even 100 results will be above 62% to correct the difference and get us closer to the actual value of 62%.  This is commonly known as the gambler’s fallacy (based on the faulty idea that after 100 or 200 (or even more) tries at a slot machine, the gambler is then “due” for a win eventually).  But this is not necessarily correct – the probability that you’ll find someone in that town who likes Italian food is always 62%, so there is a 62% chance that the 1,001 person will like Italian food, but also a 38% chance that they won’t.

This is related to another fallacy, called the law of small numbers, which says that people tend to assume that a small sample of a population or of trials is representative of the larger population or of a larger probability, which is not necessarily true.  If you take the people 500-600 that you tested for Italian food affinity, it is not true that that probability should be 62%.  The Law of Large Numbers simply states that as you take more and more observations, the probability will get closer and closer to 62%.

This is an important law to keep in mind when looking at polling results, voting results, for insurance companies to figure out the probability that some event may happen, and many other examples.


VI. Conditional Probability

Conditional probability is the probability that some A will occur given B that has already occurred.

What is its significance?

One of the applications of conditional probability is in law; it is part of a concept known as the Prosecutor’s Fallacy.  This rests on the incorrect assumption that the probability that A will occur given B is the same as the probability that B will occur given A.  The first, and most common, example is the Sally Clark case.

Sally Clark Case

Sally Clark had one son in 1996 who died fairly quickly after his birth.  Again, she had a second son and he died quickly after his birth.  She said that they both died of SIDS (Sudden Infant Death Syndrome), but she was still arrested for killing her two sons.  During her trial, a statistician declared that the probability of a child dying of SIDS is 1 in 8500, and therefore the probability of two children dying of SIDS is 1 in 73 million ((1/8500)^2).  This seems fairly persuasive, and in fact the jury thought so too and convicted her.

But, the statistician ignored conditional probability: it turns out, the probability that a second child will die of SIDS if the first one has already died of SIDS increases substantially.  Additionally, the probability of a child dying of SIDS if the child is male is also much higher.

Secondly, the jury should have weighed the two possibilities: that of both children dying of SIDS and that of Sally Clark killing both her sons.  It turns out that the probability that Sally Clark killed both her sons is much, much lower than the probability that they both died of SIDS.  The jury, the lawyer, and the statistician did not consider these when arguing the case.

OJ Simpson Case

OJ Simpson was on trial, suspected of killing his ex-wife.  There was lots of evidence against him, but the defense argued that because Simpson abused his wife, it was highly unlikely that he would kill her (the statistic was 1 in 2500 abusers kill their significant others).  Therefore, to the jury, it seemed like OJ Simpson was likely to be innocent.

However, the pertinent statistic at this trial was not the one presented.  As Leonard Mlodinow puts it, “The relevant number is not the probability that a man who batters his wife will go on to kill her (1 in 2,500) but rather the probability that a battered wife who was murdered was murdered by her abuser” (120).  And, the relevant probability was about 9 in 10 abused women who were killed were killed by their abusers.  Therefore, statistics was actually in favor of the prosecution, not the defense.

When thinking about probability, the lesson to take from these two cases is that it is very important to think about the relevant probability.

VII. Patterns and Randomness

What is “randomness”?

Something is considered “random” when it seemingly has no pattern and is unpredictable, meaning that if you rerun the same process multiple times, you will get different results even if everything remains the same.

How do we know if something is “random”?

Humans are notoriously bad at identifying randomness; in fact, humans consistently seem to find patterns even in perfectly random processes.  It is how the “monkey-flower” got its name (see and why superstitions exist.  When a student does really well on two tests in a row wearing the same shoes, instead of attributing it to studying or pure random chance, they carefully put the “lucky shoes” away in their closet until their next test.  When a Fortune 500 company loses a lot of money over a certain period of time, instead of suggesting that random processes in the market could have caused it, people fire the CEO because there needs to be a pattern and a reason for each event.

People are also pretty bad at identifying random processes when they are supposed to be random.  Leonard Mlodinow in his book The Drunkard’s Walk tells a story of the first few generations of iPods that Apple developed; they initially created the “shuffle” function to be purely random, as it is supposed to be.  However, purely random processes can repeat themselves, and Apple began to hear complaints that the “shuffle” function was not random because the same song would play back to back, or the same artist would play for three songs in a row.  Apple actually had to make “shuffle” less random so people would believe that it is more random.

There are a lot of mathematical ways to determine whether a process is truly “random,” but the simplest way is to just entertain the idea that not everything follows a pattern.  The process of finding great actors, hiring CEO’s, getting accepted to schools all are affected somewhat by random processes.  Who knows, maybe if big TV shows like Modern Family or movies that make a lot of money like Avatar or Frozen had been aired 10 years later or 10 years earlier or even if we had rewound time and done it all again, they wouldn’t have done as well.

For those still interested in the idea of “randomness,” I put a youtube video of Leonard Mlodinow giving a talk in the works cited section; you should check it out!

VIII. Correlation vs Causation

“Correlation” and “Causation” are two words that are thrown around a lot when talking about data analysis and interpretation…but what do they actually mean?

What’s correlation?

Correlation is a statistical term that describes how closely two variables are related to each other.  This can most easily be shown on a linear graph:

File:Loi d'Okun.png

Found at:

The above graph shows a bunch of data points (ignore what the data points actually mean…this is just an example) and the line drawn through it is a best fit line, or a line that the people who drew this graph thought would best represent their data.  If the variables have a strong correlation, that means that those data points fit that line very well.  If the variables have a weak correlation, it means the data points fit the line a little bit.  Measures of correlation are used to determine how well related two variables are.

Here’s a list of weird correlations:  These variables have a relationship, as is clear from the graphs provided.

What’s causation?

Causation shows that a change in one variable causes a change in the other variable.  If you took a look at the Buzzfeed article, lemon imports are correlated with highway deaths, meaning that there is a relationship between the two, but anyone who said more lemons save lives on the road would be laughed at, because it’s silly to think that there is a causation.

So, what do people mean when they say correlation doesn’t imply causation?

They mean that just because two variables have a relationship does not mean that one causes the other.  One of the most common examples of this is the debate over whether vaccines cause autism.  While there is very little evidence to back this up, there are some graphs floating around the internet that show a clear correlation between diagnoses of autism and amounts of vaccinations.  But, just because these graphs show a positive linear slope does not mean that vaccines cause autism; this could be because as medical advances increase we are able to understand and diagnose more cases of autism, and separately medical advances cause us to develop more vaccines and that increased vaccination rates (for more information about vaccines, check out my recent post:

Another example was presented by Leonard Koppett, a sportswriter, who showed that there is a correlation between who wins the Super Bowl and changes in the stock market.  However, despite what some people believed after he stated this, this does not mean that the team that wins the Super Bowl can directly influence how the stock market works.

So, how do we determine which are causations?

One way is to perform controlled experiments to shed some light on whether one causes the other; if we can change one variable and see its changes on the other variable without any other factors affecting the outcome, then we can say with more certainty that there is a causation.

An easier way to see is if we can explain logically and rationally why there would be a causation.  It’s easy to throw out the lemon vs. highway accidents as just correlation because it doesn’t make any sense for lemon imports to affect highway accidents, unless somehow lemons were repeatedly falling off trucks and causing lots of accidents.

The autism vs. vaccines example is a little tougher, because to someone who hasn’t studied this extensively there could potentially be an explanation for causation, so then we have to wonder whether there could be other reasons for the correlation.  Once we do this, we realize that medical advancements can independently cause both, and then we have some rationale for this being just a correlation.  Unfortunately, a lot of misunderstandings about correlation vs. causation have caused misguided mistrust in vaccines, which can become a huge problem (for more information, see my recent post

This is a very important concept for making educated decisions, because given a graph of two variables with a positive correlation it is important to step back and think what the creators of the graph want it to say vs. what the graph actually says.  Sometimes it requires a little extra understanding and research, and sometimes it requires relying on experts, like those who research and publish papers on autism and vaccines.  However, everyone has the ability to interpret data and distinguish between correlation and causation.

Up next in Combating Data Manipulation – a little more about voting!

IX. Does Voting Work?


File:Vote with check for v.svg

Found at:

Does voting work?  This seems like an obvious question, all you do is have a bunch of people select a candidate and then count up the ballots.  But, there is actually a lot of mathematics behind figuring out the most effective method of voting so that the results accurately reflect what the population believes.

Kenneth Arrow, a prominent economist, developed a list of standards that a good voting method should have:

(1) Decisive: there should always be one winner.

(2) Pareto Principle: if all the voters vote for candidate A, then candidate A should win.

(3) Nondictatorship: no single voter should be able to decide the election

(4) Independence of Irrelevant Alternatives: If candidate A wins the election over B and C, then removing candidates B or C should not change that outcome.  Candidates B and C are considered “irrelevant” because they didn’t win the election.

This list seems pretty reasonable, but once we start applying these principles to find a voting system that meets them, things get far more complicated…

Plurality vs. Majority

There are many different ways to determine the winner of an election.  Two of them are known as plurality and majority rule.  The difference between them is extremely subtle:

a) Plurality: this rule says that the winner of an election is preferred  by the majority of the voters in the population.  So, if there are candidates A, B, and C and candidate A gets 45% of the votes, B gets 35%, and C gets 20%, candidate A wins.  This is used in the United States to determine the outcome of elections.

b) Majority: this rule says that the winner of an election is preferred over all the other candidates by the majority of the voters in the population.  The ballot would allow the voter to rank the candidates to show what their order of preference is, instead of just voting for one candidate.  The nice thing about this method is that, say you really love candidate A, but if candidate A is not going to win you would much rather candidate B win over C; this method allows you to state that in your ballot.  So, if we look at candidates A, B, and C again, we get something like the following data:

45% of the voters have the following preference: A>B>C

35% of the voters think the following: B>C>A

and 20% of the voters think the following C>B>A

In this case, B is the winner of this election, because 35%+20% = 55% of the voters prefer B to A and 45%+35%=80% of the voters prefer B to C, which are all majorities.  So, in this case, you look at the preferences of one candidate to another.

As you can see, it is the same election, but determining the election winner in different ways will give two different winners.

So, how do we determine which method is better?

How do these let us determine which rule described above (plurality or majority) is more fair?  Well, Arrow’s impossibility theorem states that actually no such voting system can meet all of these principles.  Here are some problems with the two rules:

a) The plurality rule tends to break rule #4 a lot; the candidates who did not win end up affecting the election a lot, as what happens in the United States elections occasionally.  For example, in the Bush v. Gore election, the election in Florida was so close that had Ralph Nader not been on the ticket, Gore could have won decisively.  In this case, the voting method did not meet Arrow’s principles of fair voting.

b) The majority rule falls prey to the Condorcet Paradox, which states that the preferences of a population can end up being  irrational.

Say there are three voters and three candidates, our previous A, B, and C.  They have the following preferences:

Voter 1: A>B>C

Voter 2: B>C>A

Voter 3: C>A>B

So now we use the majority rule by pitting each candidate against one other:

A vs. B: A is preferable to B for 2/3 of the voters, so A wins.

B vs. C: B is preferable to C for 2/3 of the voters, so B wins.

C vs. A: C is preferable to A for 2/3 of the voters, so C wins.

So, A is more preferable than B, B is more preferable than C, and C is more preferable than A, or


But wait…that doesn’t make any sense.  This is an example where, even though the preferences of individual voters are very rational, when using the majority rule the preferences of the population become suddenly irrational.  This is a big problem for the majority rule.

So, does voting work?

Each one of Kenneth Arrow’s principles seems very logical and appealing, but at least one is violated in every voting system.  There are more preferable systems, but according to Arrow, there does not exist a perfect voting system that effectively takes the beliefs of all individual voters an consolidates them into one final decision.  What is interesting is how the amount of parties affects these principles.  What if we had only two parties?  Would the voting system be fairer?  What if we had no parties at all?  All food for (scientific) thought…

For more information about voting, check out the previous post:

X. How to Combat Data Manipulation

We’ve spent the last 9 posts exploring the good and the bad of data.  We’ve seen examples of data being skewed or manipulated to advance a particular agenda, and we’ve seen examples of math being used to model aspects of life.  But, how can people recognize when numbers are being manipulated?  Here are a few steps to take when presented with a number:

(1) Ask where the number comes from: know who came up with the number and what methods they used.  For example, if given polling data, ask what the sample size of the poll was and how they found the people to poll.  You also may want to ask who conducted the poll, to see what kind of incentive they may have to twist the data.

(2) Ask yourself if you think they used correct methodology: do you really think that the pollsters adequately captured a representative of the whole population?  Do you think that makeup companies really conduct proper tests before determining how much strength or lift their mascara gives?  A lot of time, catching data manipulation is all about thinking and using your own intuition before taking a statistic or number as truth.  Using an example from an earlier post, it doesn’t take too much math or previous knowledge to think about the statement of Blue Dog Coalition: “Throughout the first 224 years (1776-2000) of our nation’s history, 42 U.S. presidents borrowed a combined $1.01 trillion from foreign governments and financial institutions according to the U.S. Treasury Department. In the past four years alone (2001-2005), the Bush Administration has borrowed a staggering $1.05 trillion” and recognize that it doesn’t make a whole lot of sense when you factor in inflation and the changing value of the dollar.

(3) Know the error: there is error inherent in every measurement conducted, but the error is very rarely taken into account when reporting a number or statistic.  Sometimes, the error can be so large that the statistic basically doesn’t mean anything, and so it is important to keep this in mind when evaluating data.

(4) Interpret the data yourself whenever possible: many times when statistics are reported, they are followed by an interpretation of the data.  A fictional example could be “This study has shown that 64% of people who drank lemonade could run faster than before the lemonade, therefore drinking lemonade makes people fitter.”  64% is the statistic, but the rest of the sentence is interpreting the data for us.  However, the interpretation is a little silly and is not necessarily what the data is telling us.  Whenever possible, it is always safest to examine the data yourself, come to your own conclusion, and then compare it to the interpretation given.  This way, you can catch silly interpretations that aren’t really supported by the data.

(5) Make sure that the data is in the right context: this is very applicable to election and polling results.  It is very common, especially with presidential elections, that people will start collecting polling data to determine the outcome of an election almost a year in advance.  In that year, the candidates will campaign, undecided voters will decide based on speeches and debates, and sometimes decided voters will hear something they do or don’t like in a speech or debate and will change their mind.  A lot can happen in that year, and therefore the chances that those poll results will accurately predict the election outcome is very low.  Nate Silver, in his book The Signal and the Noise (which I highly recommend for anyone interested), actually does the math and determines that the likelihood of a candidate who is shown ahead in the polls a year in advance will actually win is between 52% and 81%, depending on the size of their lead.  This is an example of when the data may have been correctly collected, but it just isn’t in the right context yet.

(6) Make sure that the data and the interpretation make sense: this one is a little tricky, because sometimes data that has been correctly collected and interpreted can give us very surprising information about the world.  However, looking at data and making sure that the conclusion obeys physical laws and is, to some extent, realistic can weed out a lot of silly and tampered data.  For example, in the Correlation vs. Causation post I said that there is a correlation between lemon imports and highway accidents.  But, just thinking about this can quickly rule it out as being a correct interpretation; why would lemon imports have anything to do with highway accidents?  Asking that question “why are these two variables connected” can shed a lot of light on whether the interpretation of data, or the data itself, is accurate.

It’s always important to question any data that you are presented with, and relying on your own knowledge and instinct can really help combat any forms of data manipulation.

Works Cited

Jonathan K. Hodge, The Mathematics of Voting and Elections: A Hands-On Approach

Charles Seife, Proofiness×3162826

Leonard Mlodinow, The Drunkard’s Walk: How Randomness Rules Our Lives

Wayne Journell, Interdisciplinary Education: “Lies, Damn Lies, and Statistics: Uncovering the Truth Behind Polling Data”

Robert W. Pearson, Statistical Persuasion

Nate Silver, The Signal and the Noise


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

"Science is a way of thinking much more than it is a body of knowledge" – Carl Sagan

%d bloggers like this: