Linear regression is one of the most common algorithms for establishing relationships between variables in a dataset. Mathematical models are the tools data scientists need to perform predictive analytics. In this blog, I’ll explain the basic concepts and give examples of linear regression.
What is a regression model?
Regression models describe the relationships between dataset variables by fitting lines to the observed data. This is a mathematical analysis that classifies which variables affect and are most important. It also determines how confident we are about the factors involved. The two types of variables are:
- reliance: Factors you are trying to predict or understand.
- Independence: Factors that may affect the dependent variable.
If the dependent variable is quantitative, a regression model is used. In the case of logistic regression, it can be binary. However, this blog mainly focuses on linear regression models where both variables are quantitative.
Suppose you have data on monthly sales and average monthly rainfall over the last three years. Let’s say you plot this information on a chart. The y-axis represents the number of sales (dependent variable) and the x-axis represents total rainfall. Each dot in the graph shows the amount of rain in a particular month and the corresponding number of sales.
If you look at the data again, you may notice the pattern. I think sales will increase on days when it rains a lot. However, it can be difficult to estimate how much it usually sells when it rains a certain amount, say 3-4 inches. You can get some certainty by drawing a line in the center of every data point on the graph.
Now, with statistical software such as Excel, SPSS, R, and STATA, you can draw the best line for your data. You can also output an expression that expresses the slope of the line.
Consider the following equation in the above example: Y = 200 + 3X. You can see that we sold 200 units when it didn’t rain at all (that is, when X = 0). Assuming the variables remain the same as they move forward, for every inch of rain, an average of 3 more units will be sold. We sell 203 units for 1-inch rain, 206 units for 2-inch rain, 209-inch for 3-inch rain, and so on.
Normally, the regression line equation also includes an error term (Y = 200 + 3 X + error term). It takes into account the reality that the independent predictor is not always the perfect predictor of the dependent variable. And this line only provides a quote based on the available data. The larger the error term, the less certain the regression line.
Basics of linear regression
A simple linear regression model uses a straight line to estimate the relationship between two quantitative variables. If you have multiple independent variables, use multiple regression instead.
Simple linear regression analysis involves two things. First, we show the strength of the relationship between the dependent and independent factors in the historical data. The following shows the value of the dependent variable at a particular value of the independent variable.
Consider an example of this linear regression. Social researchers who want to know how an individual’s income affects their level of well-being perform a simple regression analysis to see if a linear relationship occurs. Researchers obtain quantitative values for the dependent variable (happiness) and the independent variable (income) by investigating people in specific geographic locations.
For example, the data includes income and well-being (ranked on a scale of 1 to 10) for 500 people in Maharashtra, India. The researcher then plots the data points and fits a regression line to see how the income of the respondents affects their well-being.
Linear regression analysis is based on some assumptions about the data. There is:
- The linearity of the relationship between the dependent variable and the independent variable. In other words, the best line is a straight line, not a curve. )
- The uniformity of the variance, the size of the prediction error, does not vary significantly between the various values of the independent variable.
- Independence of observations in the dataset. Indicates that there is no hidden relationship.
- Normality of the data distribution of the dependent variable. You can check the same using R’s hist () function.
Mathematics behind linear regression
y = c + ax is the standard equation, y is the output (I want to estimate), x is the input variable (I know), a is the slope of the line, and c is the constant.
Here, the output changes linearly based on the input. The slope determines how much x affects the value of y. The constant is the value of y if x is nil.
Let’s understand this through another example of linear regression. Suppose you are employed by an automobile company and want to investigate the passenger car market in India. Let’s say that gross domestic product is affecting passenger car sales.To make a better plan for your business, you may want to find a linear equation for the number of vehicles sold domestically with respect to GDP.
To do this, we need sample data for annual passenger car sales and annual GDP figures. You may find that this year’s GDP will affect next year’s sales. Even if GDP was lower than in either year, car sales declined the following year.
A little more work needs to be done to prepare this data for machine learning analysis.
- Start with the equation y = c + ax. Where y is the number of vehicles sold in a year and x is the GDP of the previous year.
- You can use Python to create a model to find c and an in the above problem.
check out This tutorial Understand step-by-step methods
Performing a simple linear regression in R makes the results much easier to interpret and report.
In the same linear regression example, let’s change the equation to y = B0 + B1x + e. Again, y is the dependent variable and x is the independent or known variable. B0 is the constant or intercept, B1 is the slope of the regression coefficient, and e is the estimation error.
Statistical software like R can find the best line in the data and search for B1 to minimize the overall error of the model.
Follow these steps to get started.
- Load the passenger car sales dataset into the R environment.
- Run the command to generate a linear model that represents the relationship between passenger car sales and GDP.
- sales.gdp.lm <-lm (gdp ~ sales, data = sales.data)
- Use the summary () function to display the most important linear model parameters in tabular form.
Note: The output includes results such as calls, residuals, and coefficients. The Call table shows the expression used. Residual provides median, quartile, minimum, and maximum details to show how well the model fits the actual data. The first row of the Coefficients table estimates the y-intercept and the second row shows the regression coefficients. The columns in this table have labels such as Estimate, Std, and so on. Error, t-value, and p-value.
learning Machine learning course From the world’s top universities. Get your Master’s Degree, Executive PGP, or Advanced Certificate Program to move your career fast.
- Connect the (intercept) value to the regression equation to predict the sales value for the entire range of GDP numbers.
- Examine the (estimated) column to see the effect. The regression coefficient shows how much sales change as GDP changes.
- From the (standard error) label, examine the fluctuation of the estimated value of the relationship between sales and GDP.
- Look at the test statistic under (t-value) to see if the result happened by accident. The higher the t-value, the less likely it is.
- Examine the Pr (> | t |) column or p-value to see the estimated effect of GDP on sales if the null hypothesis is true.
- It presents results using estimated effects, standard errors, and p-values to clearly convey the meaning of the regression coefficients.
- Include the graph in the report. Simple linear regression can be displayed as a plot chart containing regression lines and functions.
- Calculate the error by measuring the distance between the observed y value and the predicted y value, square the distance at each value of x, and calculate the average of them.
The linear regression example above outlines the generation of a simple linear regression model, the detection of regression coefficients, and the calculation of estimation errors. He also touched on the relevance of Python and R in predictive data analysis and statistics. Practical knowledge of such tools is essential for pursuing today’s data science and machine learning careers.
If you want to hone your programming skills Advanced machine learning certificate program By IIT Madras and upGrad. Online courses also include case studies, projects and expert mentorship sessions to bring industry orientation to the training process.
Strengthen your career in machine learning and artificial intelligence
https://www.upgrad.com/blog/linear-regression-explained-with-example/ Linear regression described in the example