We can understand this concept with the help of an example of the expost facto studies related to cancer.

1.Does smoking cause cancer?

2.Can we predict the chances of developing cancer on the basis of how long one has been smoking?

To answer these questions, we need experimental data but we can’t do experiments on the human subjects due to ethical reasons. Then we make use of non experimental data and try to understand the nature of the relationship between the presumed cause and effect by using the techniques of correlation and regression.

Suppose we find the following

- The cases of cancer is more frequent in the smokers
- Those who have been smoking for longer , the greater is the frequency of cancer in them

If we plot a graph of the data by putting the frequency or duration of smoking along the X axis and the cases of cancer in them along Y axis, we get a scattered graph. It is not a straight line. It means we can’t be confirmed for this in the straight forward way.

In this case, we try to plot a straight line that connects the maximum number of the points presented in the scattered graphs. It is called a best fit line. Some points are above and some are below the best fit lines. This shows the residuals and we try to understand how they regress towards the straight line.

The straight line so drawn doesn’t touch the origin (0,0) but at some point above on the Y axis. It means that cancer is present even when smoking frequency /duration is zero.

It indirectly mean some other factors are present ( may be genetics) apart from smoking in the cases of cancer.

The representative equation of the relationship is

Y= mX+c

The linear analysis of the data suppose gives m=3

It would mean that the likelihood of developing cancer is three times the years of smoking, plus some complex variations.

If we try to understand and predict the cases of cancer, we can do it to some extent. The more accurate analysis is possible from complex questions involving more than one independent variables with the help of SPSS.

In the simplest way, we can say regression analysis helps us to estimate /understand the probability of causation from the simple correlational data.

-Arun Kumar, mentor Beautiful Mind -IAS