Here at Social Scientific, we perform a continuum of activities to uncover stories behind data for our clients. And our clients use this knowledge to make decisions. One such activity is building statistical models. In a series of posts, we not only discuss what a model is, what makes it statistical, and what model building is, but also describe a sample of statistical models and their use cases. This post focuses on what a model is.

A Model Is A Mathematical Representation Of How Data Is Generated

When you hear the word “model”, you think of a representation of a person or thing, typically on a smaller scale than the original. Examples include a gold figurine of an elephant, a clay-made architectural model of the Cathedral of Milan, and a cardboard-made Model T;  one of the first Ford cars to be mass produced. Here at Social Scientific, we use the word “model” in a similar way but with two main differences.

Think of observed data as the outcome of unobserved process(es)

Firstly, what is being represented is not a person or thing but a process.This process is one that produces the data we observe around us. For example, you can talk of the process that generates military coups around the world, the process that generates presidential election victories for the Republican Party in the USA, the process that generates rain on a given day, or even the process that generates high employee turnover within a company. Secondly, we build models with one or more mathematical equations as opposed to, say, gold, copper, clay, plastic, cardboard paper, etc. Here is an example of a model of a potential process that generates the annual income of workers in Brazil:

Y = A + B*X    (1)

Y is the annual income of a worker; A is the annual income of the same worker when she has 0 years of schooling ; X is the number of years of schooling that worker has; and B is the additional income the worker gets for each year of schooling. In plain english, what this equation is telling us is that we can predict the annual income of any worker in Brazil if we know three things: (i) how many years of schooling she has (X), (ii) what that worker’s annual income will be if she has 0 years of schooling (A) [note: we can approximate this using the annual income of a workers who have 0 years of schooling], and (iii) how much income one gets for every year of school they complete (B) .

Understand that modeling such process(es) is part-art, part-science

Granted that equation 1 is a good representation of reality (i.e. this is how the data for worker salaries in Brazil are generated), there is a problem. Suppose you go to Brazil and observe worker salaries there, you will notice that equation 1 doesn’t hold strictly. Specifically, some of the salaries are more or less than what equation 1 predicts. This compels us to modify our initial representation of reality. Thus, equation 1 is tweaked to become equation 2 ,below, where C is the amount by which an observed worker salary is higher or lower than what equation 1 predicts it will be.

Y = A + B*X + C   (2)

C is what statistical modelers refer to as the “error term”. In practice, C can represent much more. It can represent all errors in equation 1’s prediction stemming from invalid assumptions that are inherent in the equation/model itself. For example, C can represent mistakes in equation 1’s prediction stemming from the fact that the increment in annual income that education confers isn’t necessarily linear. Thus, after a certain level of education, the income increment from additional years of school diminishes.

C can also represent potential confounders of the relationship between years of schooling and annual income. In this case, a confounder is any difference (other than number of years of schooling) between workers with more years of schooling and workers with less years of schooling that impacts annual income. An example of this is market demand conditions for a worker’s skills. When demand is high it is likely that  remuneration will be high even if years of schooling aren’t more.

A Model Can Be Deterministic or Stochastic. The Former Doesn’t Have Uncertainty But The Latter Does. 

Now you know that a model is one or more mathematical equations representing a process that generates data. But a model can be deterministic or stochastic. A deterministic model is a model that has no uncertainty. In other words, whenever we collect data to validate the model’s predictions, we find no discrepancies. On the other hand, we find discrepancies when validating a stochastic model’s predictions.

Going back to our model of the process that generates annual income of workers in Brazil, a deterministic version of that model will be equation 1 while a stochastic version of that model will be equation 2. Usually, when we discuss statistical models, we are referring to stochastic models.

In the next post, we will discuss what makes a model statistical. In the meantime, we hope you found the current one useful.

References

Rao, P., & Miller, R. L. R. (1978). Applied econometrics. Wadsworth.

Freedman, D. A., Collier, D., Sekhon, J. S., & Stark, P. B. (2010). Statistical models and causal inference: A dialogue with the Social Sciences. Cambridge University Press.