Suppose I gave you, the Data Analyst, a dataset of information on sales of Ford automobiles, and suppose I told you to use that dataset to predict total national sales of Ford automobiles for next 12 months. What would you want to know about the data you were given?
If you were given data with information on past sales of Ford automobiles, and if you wanted to use that information to come up with a forecast of future sales, you would want to know, for example,
- Which regions of the country were included in the data,
- Which models of vehicles were included,
- Which time periods of sales were included, and
- Which regions, models, and time periods were not included in the data for which there were Ford sales. That is, which sales data are missing from your dataset.
More generally, you would want to know where the data came from, which information the data include, and which information you need to make your forecasts are missing from the data. To do a good job in forecasting future sales, you would also want to know such things as:
- Was the economy strong or weak when the data were collected? If the economy was strong in the past but it is expected to be weak in the future, then you would expect sales to decrease.
- You would want to know if Ford automobiles were easily accessible to customers. That is, for people who wanted to buy Ford cars, were there Ford dealerships close by and with vehicles in stock? If Ford cars are expected to become more readily available in the future, then you would expect sales to increase.
- You would want to know if people who wanted to buy a car had access to other brands of similar cars. If other brands of cars are increasingly available, then that might decrease future demand for sales of Ford’s automobiles.
In other words, you want to know about the environment in which the data you have were generated and how that environment might differ from the environment you expect in the future, during the time your sales predictions will take place. You must align (your data and your analysis) between (the environment you have and the environment you want to predict).
A data-generating system is a concept I created to describe the process and environment from which data are generated. Understanding this system tells you, for example, which data were generated, how, when, where, and why the data were generated, and other factors affecting what the data look like
A typical data-generating system looks like this:
There are five main components of a data-generating system
- The Content Provider provides content to Customers
- The Data Collectorcollects data from Customers. The Data Collector decides which information to collect, how to collect it, and from whom.
- The Customer “provides” or “generates” data, for example, by buying products, consuming content, or providing feedback.
- The Client or Data Analystuses data collected by the Data Collector from the Customer to perform analyses.
- The Context and Environment are all the factors in the background or setting that affect the information that ends up being collected by the Data Collector from the Customer.