Suppose I gave you, the Data Analyst, a dataset of information on sales of Ford automobiles, and suppose I told you to use that dataset to predict total national sales of Ford automobiles for next 12 months. What would you want to know about the data you were given?
If you were given data with information on past sales of Ford automobiles, and if you wanted to use that information to come up with a forecast of future sales, you would want to know, for example,
- Which regions of the country were included in the data,
- Which models of vehicles were included,
- Which time periods of sales were included, and
- Which regions, models, and time periods were not included in the data for which there were Ford sales. That is, which sales data are missing from your dataset.
More generally, you would want to know where the data came from, which information the data include, and which information you need to make your forecasts are missing from the data. To do a good job in forecasting future sales, you would also want to know such things as:
- Was the economy strong or weak when the data were collected? If the economy was strong in the past but it is expected to be weak in the future, then you would expect sales to decrease.
- You would want to know if Ford automobiles were easily accessible to customers. That is, for people who wanted to buy Ford cars, were there Ford dealerships close by and with vehicles in stock? If Ford cars are expected to become more readily available in the future, then you would expect sales to increase.
- You would want to know if people who wanted to buy a car had access to other brands of similar cars. If other brands of cars are increasingly available, then that might decrease future demand for sales of Ford’s automobiles.
In other words, you want to know about the environment in which the data you have were generated and how that environment might differ from the environment you expect in the future, during the time your sales predictions will take place. You must align (your data and your analysis) between (the environment you have and the environment you want to predict).
A data-generating system is a concept I created to describe the process and environment from which data are generated. Understanding this system tells you, for example, which data were generated, how, when, where, and why the data were generated, and other factors affecting what the data look like
A typical data-generating system looks like this:
There are five main components of a data-generating system
- The Content Provider provides content to Customers
- The Data Collector collects data from Customers. The Data Collector decides which information to collect, how to collect it, and from whom.
- The Customer “provides” or “generates” data, for example, by buying products, consuming content, or providing feedback.
- The Client or Data Analyst uses data collected by the Data Collector from the Customer to perform analyses.
- The Context and Environment are all the factors in the background or setting that affect the information that ends up being collected by the Data Collector from the Customer.
Return to the example where I said that you were given a dataset of information on sales of Ford automobiles and had to use that information to predict future sales of Fords. This is what that data-generating system looks like.
- The Content Provider is Ford, the Company. Ford provides cars for sale to Customers
- The Data Collector is also Ford. In the course of normal business operations, Ford collects information on where and when Customers purchase Ford cars, together with the make model, and terms of purchase.
- The Customers are the buyers of Ford cars. The nature of their purchases — make model, and terms of purchase —is the information they provide to the Data Collector, who, in this case, is Ford.
- The Client is you. You are using information on past sales of Ford cars to predict future sales.
- Context and Environment information includes, for example, the state of the economy, the accessibility of Fords, the availability of other brands of cars, and so on.
The Content Provider
The Content Provider provides content to Customers, either directly or through the Data Collector. For example, Netflix, the Company, provides movie services to Customers. More specifically, Netflix provides content to the Netflix Platform Host, who provides movie services to Netflix Customers. As another example, Ford, the Company, provides automobiles to Customers, generally through Ford Dealers.
The Data Collector collects data from Customers. He decides which information to collect, how to collect it, and from whom. The Data Collector is a gate keeper. He decides which data to include and which to exclude. As a Data Analyst, one of the very first questions you should ask yourself when given a dataset is, “What were the filters or criteria used to determine which information made it into the dataset and which information were excluded.”
Examples of Data Collectors include the following:
- In the Netflix system, Netflix Platform Hosts or Developers are the Data Collectors. They create the design and layout of the Netflix platform, provide content from Netflix, the Company, to Customers, and collect information on views and ratings from Customers
- In the healthcare system, Electronic Medical Record Systems Developers and Healthcare Providers are both Data Collectors. EMR Developers design the entire data collection system that Health Care Providers use to collect information from Patients: the data collection windows, the menu choices, input options, and so on. Healthcare Providers use the data collection instruments designed by EMR Developers to collect Patient information
- In the Ford system, Ford Management and the Ford Dealers are the Data Collectors. Ford Management determines which data elements to collect, while Ford Dealers collect information from Customers.
When collecting information, the Data Collector first determines the purpose for collecting the data. The purpose shapes the data inclusion criteria. And second, based on the purpose, the Data Collector determines the criteria for determining which information is included in the dataset. The data inclusion criteria determine both which observations to include in dataset, as well as which data elements to include for each observation. You must understand the Data Collector’s purpose and data inclusion criteria to understand which data have been included.
Data collection instruments are tools for collecting information. Data collection instruments include, for example, written questionnaires, oral interviews, and observations of Customers’ actions, for example, through websites and browsers. The Data Collector’s purpose for collecting the data and his criteria for inclusion will shape instruments and thus which information end up being collected.
Choice architecture plays an increasingly important role in the design of data collection instruments. Wikipedia defines choice architecture as the design of different ways choices can be presented to consumers, together with the impact that presentation has on consumer decision-making. In other words, choice architecture recognizes that the way you present choices to people can affect which of the options they choose. Choice architecture techniques include which words are used, which options are provided, and the layout of the data collection instrument. As the internet becomes a more pervasive source of information, I believe choice architecture is increasingly being used to affect the decisions we all make.
For example, suppose you’re on YouTube watching a video, and the video reaches the end. Have you noticed that the next video in the queue will automatically start to play unless you specifically click to stop it from playing? It used to be the other way around – the next video wouldn’t start playing until you clicked on it. But developers discovered that people will tend to stay on the page longer if the next video automatically starts playing.
This is a case in which YouTube was able to change the choices Customers make by changing the default option. Generally the default is if you take no action, then nothing happens: the next video in the queue doesn’t start playing until Customers click on it. YouTube changed the default so that an action occurs — the next video starts playing automatically — unless Customers click on the do not play button. In this case, when the default is to opt in — instead of to opt out — you get a whole lot more people opting in.
To summarize, the Data Collector designs the data collection instruments and collects data from Customers. As a data analyst seeking valid and reliable answers to your questions, you want to understand three things in particular. You want to understand the Data Collector’s purpose for collecting the data. You want to understand his criteria for inclusion of both the observations he chooses to include in the dataset, as well as the data elements he chooses to include for each observation. And you want to understand the Data Collector’s design and layout of the data collection instrument and the extent to which choice architecture is affecting the choices consumers make.
The Customer, User, or Data Provider provides information to the Data Collector. Consider some examples of Customers and the data they provide:
- In the Netflix system, the Customers are Netflix Users, that is, the people who buy movie services from Netflix. The information they provide is the movies they choose to watch and the ratings they give.
- In the healthcare system, the Customers are Patients. Patients provide information on their health and well-being.
- In the Ford system, the Customers are buyers of Ford automobiles. They provide information on the terms of sale and their ability to pay.
The information the Data Collector and you, the Data Analyst, both want from the Customer is the same thing: Honest and accurate responses by the Customer to the Data Collector’s information request. However, the Information you and the Data Collector actually get from Customer is that which he provides. In short, you want the truth, but you get what the Customer chooses to gives you.
Let’s consider some examples of the difference between the information you and the Data Collector want from the Customer, and the information you actually get. The info you want from Netflix Customers are the ratings that reflect what the Customers actually thought about the movie. The info you get is whatever rating the Customers actually provide. The info you want from the Patient is his true symptoms and his true actions – Is he really eating health? Did he really stop smoking? What are all the medications he is currently taking? Did he really take all his medications as prescribed? The info you get is whatever information the Patient actually provides. The info you want from Ford Customers is their actual ability and intention to pay for the cars they’re buying. The info you get is whatever the Customers choose to say.
The Information you and the Data Collector actually get misrepresents the information you both want in several different cases: Customers may simply misunderstand the question; Customers may intentionally lie; Customers may remember past events incorrectly; or Customers’ true intentions be derailed by choice architecture. Analyses based on any of these types of “wrong” information could lead your analyses to produce invalid and/or inaccurate results.
There’s another factor that might cause the information you get from Customers to misrepresent the information you want. We collect tangible information because it’s easy to quantify and capture. However, essential information is often intangible. For example, someone’s success may be due to grit, charm, or luck -- simply being at the right place at the right time. But how do you capture that in your dataset? As important as they may be, intangibles are generally difficult to quantify, so they often fail to be captured in the information being collected.
Let’s summarize what we’ve learned about the Customer in our data-generating system. The Customer provides information to the Data Collector. The information we want from the Customer is honest and accurate information. The information we get from the Customer is what he does or says. The information we get misrepresents what the Customer actually means when
- The Customer misunderstood the question, was careless, or he was intentionally dishonest;
- The Customer was influenced by cognitive or memory biases,
- The Customer was influenced by choice architecture; and/or
- Other essential information, such as intangibles, wasn’t collected.
Question to ask yourself when considering information provided by Customers include: Are all key terms clearly defined? Is the information being requested easy to access or recall, that is, effortless to provide by the Customer? Would the Customer feel comfortable providing an honest response? Is the Customer’s response being influenced by choice architecture? Mismatches between the Customers’ honest and accurate responses and the responses they actually provide will create inaccuracies in your analyses.
The Client uses information collected by the Data Collector from the Customer to answer a question. Some examples of Clients and the questions they seek to answer include:
- In the Netflix system, one Client is Netflix, the Company. Netflix uses information from Customers to answer such questions as: Which movies do Customers want to see? Or how do I provide better movie recommendations to Customers?
- In the healthcare system, one Client is Healthcare Providers. Providers use patient information collected through EMR systems to answer such questions as: How do I better treat my patients? How do I treat their illnesses? How do I prevent them from getting sick in the first place?
- In the Ford system, one Client is the Ford CEO. Ford’s CEO uses information collected from Customers to answer such questions as: Which Ford automobiles are most popular with Customers? How do I increase sales of Ford cars to Customers?
The Client wants the data collected from Customers to be as aligned as possible with the ideal dataset he needs to answer his question. When the Client’s actual and ideal data are mismatched, the Client must adjust his actual data, his interpretation of his data, and/or his analyses to compensate for the mismatch.
Context and Environment
The Context and Environment in your data-generating system complicates your analyses. The Context and Environment influences the actions Customers take. However, these influences are generally not captured in the data; so, your data are often incomplete. Your data are missing factors that affected what happened. So when you perform your analyses on your incomplete data, you end up with results that are lacking the full explanatory power of why Customers acted as they did. Context and Environment provide the background that sets the stage for the actors in the data-generating system.
The Context and Environment of a data-generating system include any background information that primes the Customer to act in a certain way, that affects the information the Customer provides, and/or that determines the generalizability of the information provided. Specific types of background information include: climate, population, culture, whether or not the Customer can be identified, whether or not the Customer is being observed, the nature of the situation, the time frame involved, and whether or not choice architecture was used.
I use the term climate here to describe both the physical and economic climates in which the Customers’ actions took place.
What’s the Weather like? Is it hot out? Cold? Windy, raining, or snowing? The weather has a surprising impact on the actions we all take. How many flights are delayed or cancelled when the weather turns bad? Bad weather, which causes passengers to be stranded in airports for long periods of time, must surely be good for airport businesses – such as restaurants and gift shops. Sales must go up when airports are crowded due to flight delays and cancellations. Do you think airport restaurants take the weather into account when they analyze past sales and predict future sales? If they don’t, they probably should.
What’s the State of the Economy? The economy cycles through booms and busts, inflation and deflation, full employment and high unemployment. Like the weather, the state of the economy has a large impact on our moods and on the actions we take. In good times we might think nothing of spending an extra $100 on a night out. But in bad times that $100 might be needed to pay the electricity bill. Almost all businesses are susceptible to the state of the economy, and if they don’t take it into account when they analyze their sales data and make predictions, they’re predictions will be off-the-mark.
The size and makeup of the population where the Customers’ actions took place will influence the Customers’ actions.
How large is the population? People in large and/or densely populated areas behave differently than people in smaller, more sparsely populated areas do. For example, People in large, dense areas have access to – and use – many more products and services than people in more sparsely populated areas. People in cities tend to use public transit, delivery services, and entertainment services, that is, they go to movies, plays, and museums – more often than people in smaller towns do. Also, actions and behaviors ripple and scale differently depending on the size and density of the population.
What do the Demographics of the Population look like? As with the size and density of the population, the age, sex, wealth, ethnicity, and so on of the population will affect peoples’ actions and behaviors. Communities with young families have play dates, communities with young adults have hang outs, and communities with older people have activities.
The beliefs of the population about what’s considered acceptable and unacceptable behavior will influence the Customers’ actions.
The Culture and Norms of the population include, for example, the expectations people have about what’s considered appropriate or inappropriate behavior. When you’re out and about, do you dress casually or more formally? When you stop to chat with someone you know, do you stand close to them while you talk or do you respect their personal space? When you go out to eat, do you tip the waiter? If so, how much? Culture has a large impact on peoples’ actions and behavior.
Right and Wrong. Like other aspects of culture, what’s considered right or wrong, and which actions are legal or illegal will influence peoples’ actions and behaviors. Of course, just because something is illegal or immoral doesn’t mean people don’t do it, but they will be reluctant to admit to others what they have done. If information in your dataset involves non-mainstream actions, such as how much people drink, how fast they drive, or whether or not they lie to or cheat on their spouses, it’s very possible your data contain inaccurate information.
You, as a Data Analysts, should always consider the culture of the Customers in your dataset and how culture might affect their behavior and/or their reluctance to admit truthfully to their actions.
Whether or not the Customer can be identified has a large impact on his behavior.
Generally speaking, when people feel more anonymous, they tend to act in ways that are less restrained, less cooperative, less productive, and less socially responsible, and they tend to obtain less satisfaction from their actions.
People act differently when they can’t see the other person’s face. Studies show that when people negotiate with other people who they cannot see – that is, they’re either put in separate rooms or blocked from seeing the other party – they behave less cooperatively and feel less satisfaction than they do when they’re face-to-face in the same room with the other party.
People act differently when they are anonymous or otherwise can’t be held responsible for their actions. We’ve all seen trolling on the internet. How many people do you think would continue to act in such uncivil ways if the people they’re trolling could identify them? Even in the “real” world – that is, as opposed to online – people become disinhibited when they think they can’t be identified. Think mob behavior.
Given the amount of information collected from people online, the lack of accountability is a serious problem for understanding how people might “normally” behave, that is, when they’re less anonymous and more identifiable.
People act differently when they know they’re being observed. In particular, the Hawthorne Effect notes that when people are being observed, they tend to work more productively. I worked on a project in which we were trying to estimate the amount of waste that employees generated while fulfilling a certain task. Data were collected using a survey for which Employees recorded their actions as they performed their tasks. When I analyzed the data, I found that – Lo and behold – the amount of waste incurred miraculously dropped to zero during the study period.
When analyzing data on Customers’ actions, consider the degree of anonymity your Customers faced and the extent to which that level of identification might have led to more inhibited or more disinhibited behavior than normal.
Nature of the Situation
The nature of the situation faced by the Customer will affect the actions he takes. Is the situation your dataset covers professional or personal? Usual or unusual? Voluntary or mandatory? More generally, what’s at stake? The type of situation involved and what’s at stake affect the Customer’s attitude, level of care, or level of effort, and thus the probable completeness and accuracy of the information provided. For example, hopefully, the information you provide on a subordinate worker’s evaluation form or on a college application form will be at least as accurate and precise as the information you provide on a feedback survey you received from your auto mechanic.
The time frame in which the Customer is acting affects the actions he takes. During shorter time frames – that is, more urgent situations or situations in which the environment is changing more quickly – Customers’ choices are limited, so they tend to make do with what’s available. During longer time frames, Customers have more time to think and more options to choose from, so their actions tend to be more thoughtful or deliberate.
Earlier, we indicated that Choice architecture is when Data Collectors provide choices to Customers in ways that affect the choices Customers make. Persuasion technologies are technologies or techniques designed to change people’s attitudes or behavior through persuasion and social influence. Persuasion technologies are a form of choice architecture. They are both derived from behavioral psychology and behavioral economics. Not only is the use of choice architecture becoming more pervasive on the internet, but employers of choice architecture are also constantly learning how to use it more effectively to sway our actions. As one source states, “Broadly speaking, most of the online services we think we’re using for “free” — that is, the ones we’re paying for with the currency of our attention — have some sort of persuasive design goal.”
Let’s summarize what we’ve learned about Context and Environment. But first, let’s go back to the beginning to understand how Context and Environment fit into the big picture of our data-generating system: We have a set of data that captures actions that Customers took. We want to understand what Customers did and why they did what they did, so we can predict what they’ll do in the future. Customers’ actions are shaped by the forces that act on Customers. These forces include first, the Customer himself – who he is and what he wants. The second force acting on the Customer is the Content Provider, who provides Content for the Customer to consume. The third force acting on the Customer is the Data Collector. The Data Collector determines which Customers to collect data from and how to collect that information.
And the last force acting on the Customer is the Context or Environment in which the Customer’s actions took place. The Context or Environment affects what the Customer does; however, information on Context or Environment is often invisible, that is, it’s not specifically captured in the data. So you, the Data Analyst, must explicitly recognize what the Context and Environment was and how it might have affected the actions the Customer took, that is, the information in your data. Context or environment includes any background information that primes the Customer, affects the information the Customer provides, and/or determines the generalizability of information provided.
Specific aspects of Context or Environment include:
- The physical and economic climate
- The nature of the population, including its size, demographics, and culture
- The degree of anonymity the Customer experienced
- The time frame and stakes involved
- Any choice architecture that might have shaped the Customer’s actions
What a Data Analyst should look for in Context and Environment is the following. First, Is the situation in which the data were collected unusual or different in any way from most? Is it taking place under an (unusual climate) or within (an unusual population)? If so, the information captured in your data might not generalize to other contexts.
Second, Is the Customer identifiable and/or can he be held accountable for his actions? If so, his actions might not capture what he actually did – that is he might lie about what he did – or his actions might not capture what he would normally do under less inhibited circumstances.
And the last thing to look for: Is the Data Collector trying to persuade the Customer to take a specific action? Is the environment one of success-breeds-success, where being at the right place at the right time is important? If so then your data might not generalize to other contexts.
Summary and Conclusion
The concept of a data-generating system emphasizes the importance of different participants in the environment who may play an important role in shaping the data Analysts use to provide insights. The concept provides an easy framework for helping Analysts think about the extent to which the data they’re using may be influenced by the Content Provider, the Data Collector, the Customer, the Client/Analyst, and/or the Context or Environment. Having a good understanding of the potential influences on their data helps Analysts better align their analyses so as to generate more accurate results from their investigations.
This analysis is taken from an on-demand course posted on Experfy, Data Quality: Are Your Data Suitable For Answering Your Questions?