No, I'm not referring to Donald Sterling's latest rant. But the recent Estately blog article "Nerdiest State in America" reminded me that when it comes to data bias, all sources are not created equal. The Estately blog has a section "Ultimate Lists" - a decidedly quirky and not-so-serious montage of "analysis". I actually love this section - it's a good read. Recent entertaining titles include "How Weird is the State You Live In?", "The Top U.S. Cities for Douchebags", and "50 Largest U.S. cities Ranked From Most 'Country' to Least" Give the Estately team lots of credit for taking the dry topic of Real Estate statistics and bringing some fun to the party. What follows isn't a serious critique, but rather the Estately blog's creative use of sources plays nicely into our data bias topic.
Many of these stories use Facebook "likes" and "interests" as a primary source for the analysis. Assuming the methodology stated is accurate, there's a common data bias issue illustrated here: data sample bias. To uncover this bias, ask: How representative of reality is your sampling or polling method? Statistically speaking, if you have a truly unbiased sample, you can use a small volume of collections in your analysis. However, if your sample is biased - even if it contains millions of observations - the analysis will be significantly flawed. Broad, biased sampling of this type is commonly referred to as straw-poll in politics, but can be thought of as a "non-probability sample". Not so good for evaluating outcomes...well, assuming you aren't just going for laughs.
Facebook is like a Straw Poll
Two of the key attributes of a good statistical sample are:
- The sampling frame must be representative of the universe of potential observations (i.e. not biased)
- The responses must be either random or complete rather than self-selected (i.e. not biased)
Some great illustrations of this come from political polling. The well-known grand-daddy of all polling errors put the Literary Digest - with a circulation of over 10 million - out of business in 1936. They collected a tremendous volume of straw poll results for the 1936 presidential election and predicted that Alf Landon would win. Over 2 million respondents sent in their pick. Can you imagine a modern day poll with 2 million? With that large of a sample, it would have to be predictive, right? Put that number in the context of today's scientific polls which routinely collect a mere 1,500 to 3,000 responses and are accurate within 5 points. Should be a slam dunk prediction, right?
Better is Better than More (yet again)
Trouble is, the Digest's sampling methods failed on both our key points. First, they questionnaire went to 10 million individuals and only 2 million (20%) chose to fill it out (self-selected). Second, the frame itself, composed entirely of Digest subscribers was biased. As the election saw, these 2 million were not a representative sample of all likely voters but instead were a large non-representative subset. So what was at stake in the 1936 election? This was the height of the Great Depression. The frame for the poll was individuals with auto registrations and listed phone numbers. Those requirements created frame bias as only individuals with a certain level of wealth would have either of those requirements, let alone both. In '36, Roosevelt won the presidency handily, with 63% of the national vote. How was this possible given the Digest's poll? Well, F.D.R. was broadly supported by the less affluent, who generally couldn't afford cars or phones during the Depression, and who were in turn under-represented in the Digest poll.
Facebook provides a similar frame to the 1936 debacle.
- While a very large set, Facebook users provide a biased frame
- Within that frame, Facebook users self-select by "liking" pages and topics
But everyone has a Facebook account, right? No, not really. Particularly not an active account. Do a quick search for "Facebook user demographics". Look at any report. You'll see that even products with this broad a reach skew toward specific age, education, geographic, and economic groups. This is Facebook's frame - and it's still biased. But it is very broad, and that can be pretty useful. It's certainly not as biased as the subscription base to a left or right leaning publication! The bigger issue here is with using Facebook likes as the driver. Now we're self-selecting in multiple ways within that broad frame.
- Users must be active in order to have activity indicators such as likes.
- Users must decide to use the "Like" feature or "Interests" feature.
- More specifically, they must use that feature to like specific pages / topics or list specific interests.
How is that self-selecting? Well, I use Facebook nearly every day. But I've listed virtually no interests and very rarely like specific pages. Why? I like to make ad targeting a little more challenging for these guys! But, in the real world, I do in fact like at least half the interests crunched for the Nerdiest States story. Heck, I even like 30% - 40% of the topics used for the "Most Country" analysis - and I'm about as far from wearing a 10 gallon hat as you can be.
And it's because people decide whether to participate (the frame) and how to participate (self-selection) that this isn't a great source for statistical analysis. What is it great for? Well, if you self-select on an interest, that makes you a great target for correlated marketing. And that, of course is what a lot of online business is all about.
So basing demographic or lifestyle analysis on Facebook likes is similar to predicting Alf Landon for president!
So is Less Really More?
No, better is simply better. Better than less. Better than more.
We've seen how huge sample size doesn't remove bias. Well the opposite is surely true as well. An under-representative sample can be just as biased...and its usually easier to spot as well since you don't get mesmerized by the big numbers of responses. Humans are generally skeptical of declarative, broad spectrum statements made by a handful of people - even when they are "experts". And guess what? We should be! Small groups are often self-selected, usually have something at stake, and often make statements that don't represent the larger group's interests.
Biased expert groups are everywhere - recognize these?
- Big Tobacco's medical experts
- The science behind the inferiority of different races
- Or that the Earth was flat
Okay, a little heavy handed. But we see this in today's news all the time. My favorite recent one is Teke Wiggins' subtitle in his recent Inman Post, "Neighborhood Information...will Fuel Residential Segregation." Check out our COO & President, Jon Bednarsh's blog post for a perspective on what's really at stake in Teke's article. This isn't to say that all small samples are biased of course. Or even if the sample is biased the results are always wrong. But we'll dive into the bias behind Mr. Wiggins' series in my next post.
Image Credit: Sean MacEntee on Flickr.com