INSIGHTS BLOG > Data Diving … Right Off the Deep End

Data Diving … Right Off the Deep End

Written on 09 April 2016

by Ruth Fisher, PhD

Are you getting as much value from your Big Data or IoT analyses as you can? There’s a very good chance you’re not. And it might not be for lack of trying. There are three, big contributors that are likely to be preventing you from being able to extract as much value as you could from your data:

You dive right into the data without first creating a roadmap;
You don’t understand the context and limitations of your data; and/or
Your analyses are too complex.

Create a Roadmap Before Diving into the Data

Creating a roadmap for your analyses before you dive into the data will help you increase the efficiency and accuracy of your analyses in two important ways.

First, defining the big picture – where you are and where you’d like to go – gives you a vision of the big picture, which helps you to guide and structure your analyses. With a good understanding of the big picture, you are also much less likely to get lost in your analyses and end up wandering down stray alleys. For both these reasons, having a clear roadmap will help you move through your analyses much more quickly and efficiently.

Second, thinking about what you would like to accomplish with your data helps you better understand what your ideal data look like. This is crucial, because you must understand how the data you have differ from that data you would like to have, so that you can adjust your analyses and interpret your findings accordingly. This is exceedingly important for increasing the accuracy of your analyses. This issue is discussed in more detail in the next section.

Understand the Context and Limitations of Your Data

Data are very contextual: they are collected in specific situations, under specific conditions, with specific intentions. If you lose site of the context in which the data were collected, then you're very likely to misinterpret what your data represent, and therefore become misinformed by any analyses you perform. Any of the following factors could contribute to your data misrepresenting what you’re trying to gauge.

Your data are bad proxies for what you’re using them for.

A classic example of a bad proxy is the ubiquitous use of GNP for a country’s well-being. Pundits generally take the view that if GNP is growing fast enough, then the country is booming, and otherwise it’s lagging. On a related note, there is also increasing concern that continued growth in global GNP will over-deplete the world’s resources and is thus unsustainable (see, for example, here). However, what both these views overlook is the fact that GNP is not, in fact, a good measure of people’s well-being. GNP does not capture, for example, a population’s access to education, healthcare, clean air, or job satisfaction. If a better proxy for well-being were found, then pundits would have more accurate measures of the state of citizens’ well-being in both good and bad times, and people would recognize that growth in global well-being, is, in fact, sustainable.

Another good example of a bad proxy is using a person’s credit score to gauge whether or not he’d make a good employee. Credit scores are measures of a person’s likelihood of repaying a debt. Employers use credit scores to assess job-worthiness because it’s an easily accessible metric. But is it an appropriate measure of job worthiness? Maybe, but maybe not.

Your data have significant errors, inaccuracies, or omissions in them.

Analyses performed using inaccurate data can provide misleading results.

Patient health data, for example, contain a high incidence of errors, including inaccurate diagnoses, inaccurate medications and dosing information, and missing information. In the case of big data analyses of patient medical records, inaccuracies in patient data can lead to inaccurate conclusion about which treatments or courses of action are most effective for patients.

In the case of data omissions, analyses can lead to spurious, or false, conclusions. There is a website run by Tyler Vigen that provides fantastic examples of spurious correlations, such as that “US spending on science, space, and technology correlates with suicides by hanging, strangulation and suffocation.”

In some cases spurious results are due to chance. Specifically, if you look at a large enough number of low probability events, you’ll eventually find one that happens (this is a version of the law of large numbers). Alternatively, spurious correlations between two data series can be due to omitted variable bias. For example an analysis might indicate that a lot of people in the South who go to the local shopping mall buy ice cream. This might lead you to conclude that people who shop at the mall have a particular preference for ice cream. However, the true relationship might actually be that people go to the mall to avoid the heat, and when it’s hot outside, people eat more ice cream. An owner of shopping malls located throughout the country might use the mistaken interpretation of this correlation to make sure all his malls contain plenty of ice cream shops. Having plenty of ice cream shops in all his shopping malls could then cause him to lose money in malls located in cold weather climates.

Your data are out of date.

In dynamic environments, the reliance on untimely data can lead to inappropriate conclusions. As the saying goes, “Generals always fight the last war.” Whenever I travel, I wonder how much money the TSA is spending trying to prevent the next underwear or shoe bomber.

Other situations in which untimely data lead to inappropriate conclusions are in the case of perishable data. Airline tickets and hotel rooms are notoriously priced using revenue management methods. If information on inventories of unsold seats or rooms are not kept up in a timely manner, then pricing algorithms won’t work to maximize revenues while minimizing numbers of seats and rooms that go unsold.

Your data are mixed and matched from different sources.

When different data elements are taken from different data sources and then used together in analyses, there is a good chance that the analysis may lead to inaccurate results. In particular, mix-and-match data are often internally inconsistent.

Suppose, for example that you have a data source that says that in Ancient Persia the price of a sack of wheat was one-tenth of sigloi, while the price of a bushel of apples was one-twentieth of a sigloi. In this case, the information for the price of wheat is internally consistent with the price of apples. With this information, we wouldn’t necessarily know how many dollars that sack of wheat cost, but we would be relatively confident that a sack of wheat was twice as valuable as a bushel of apples at the time and place the data were taken from.

However, suppose we had one source of information that said that a sack of wheat used to cost one-tenth of sigloi, while a different source that said that a bushel of apples used to cost one-twentieth of a sigloi. In this case, the data are not necessarily internally consistent, and we would be much less certain about the relative values of wheat and apples. What if the two measurements came from different time periods? Or different cities?

Your data are biased.

Perhaps the most insidious problems with data analysis occur when data are biased. Biases in data are especially likely to occur if the data have been collected from sources that exclude specific chunks of the underlying population that contribute to what you’re trying to analyze.

One of the easiest ways to determine if your data may be biased is to ask, “What criteria were used to determine if observations were either included in or excluded from my data?” If there are certain factors that cause certain types of observations to be either over-represented or under-represented in the data, then your data may very well be biased.

More obvious biases occur, for example, when data come from

Early users of a product or service. Early adopters tend to be more adventurous and less sensitive to price;
People who shop on mobile phones. Mobile phone shoppers are more likely to be more sensitive to price; or
Product reviews. Many reviewers have some sort of agenda that motivates them to post a review, which may make them unrepresentative of your target audience.

Less obvious biases occur, for example, when data come from

People who are asked to self-report information about themselves. Self-reported data are notoriously inaccurate; or
People who are successful at completing some task. These data may suffer from attrition bias by excluding information from people have tried, but failed, to complete the task.

Simplify Your Analyses

When faced with a large set of data, there is a tendency is to throw everything into the mix to see what works. There are two good reasons, however, to simplify your analyses as much as possible.

First, as analyses become more elaborate – that is, when they include more interrelationships among the different variables – you generally end up with complex, unintuitive outcomes. It then becomes difficult to navigate the relationships and associations in order to uncover the true insights.

Conversely you can gain a much clearer understanding of the underlying dynamics of your situation by first examining simple relationships among the variables. Once you’ve nailed down the basics, you may then try to further elaborate on those basics in order to hone your results.

The second reason you should simplify your analyses as much as possible is that funky relationships (i.e., multicollinearity) among your different variables can end up clouding your results. Any such inter-relationships can make it difficult to understand the real underlying dynamics. Again, you’re much better off starting out with simple analyses to understand the basics, and then further elaborating thereafter to better understand the nuances.

Getting Better Value from Your Data`

In order to perform the most effective and efficient analyses and generate the most value from your data you should plan ahead. Before you jump into your data, you should first think about what you’re looking for, which outcomes you think you might find, and what information you have to try to get you there.

By creating a model based on theory, you will have a better understanding of where you’re going, how you plan to get there, and any adjustments you might have to make along the way, either to your data or to your analyses.

Next, you must understand the context of the data in your datasets so you will be aware of any limitations of your data. Only by understanding what information your data capture will you be able either (i) to adapt your analyses to account for the limitations or (ii) to view your results through the appropriate lenses.

Finally, simplify your analyses, at least initially, until you have a clear understanding of the basic dynamics underlying your system. Only after you’ve nailed down the basics should you try to further elaborate on those basics in order to gain a better understanding of nuances or otherwise hone your results.

By using foresight to guide your analyses, having an understanding of the context and limitations of your data, and using simple analyses to uncover the basics, you will generate not only greater value from your data, but you will also do so more quickly and efficiently.

algorithm, dig data