This document discusses correlation, regression, and the general linear model. It defines correlation as assessing the relationship between two variables, while regression describes how well one variable can predict another. Pearson's r standardizes the covariance between variables. Linear regression finds the best-fitting line that minimizes the residuals through the least squares method. The coefficient of determination, r2, indicates how much variance in the dependent variable is explained by the independent variable. Multiple regression extends this to include multiple independent variables. The general linear model encompasses both linear regression and multiple regression.
This document discusses correlation and regression analysis. It defines correlation as assessing the relationship between two variables, while regression determines how well one variable can predict another. Correlation does not imply causation. Pearson's r standardizes the covariance between variables and ranges from -1 to 1, indicating the strength and direction of their linear relationship. Regression finds the best-fitting linear relationship through the least squares method to minimize residuals and predict one variable from another. It provides the slope and intercept of the regression line. The coefficient of determination, r-squared, indicates how well the regression model fits the data.
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
The regression coefficients are 0.8 and 0.2.
The coefficient of correlation r is the geometric mean of the regression coefficients, which is:
√(0.8 × 0.2) = 0.4
Therefore, the value of the coefficient of correlation is 0.4.
Correlation and regression are used to study relationships in bi-variate data. Correlation measures the degree of relatedness between two variables, while regression aims to predict a dependent variable based on independent variables. Pearson's correlation coefficient r measures the linear correlation between two continuous variables from -1 to +1. Linear regression models include simple linear regression with one independent variable and multiple linear regression with multiple independent variables. The method of least squares is used to estimate the regression coefficients by minimizing the sum of the squared errors between observed and predicted values of the dependent variable.
This document provides an overview of regression analysis. It defines regression as a statistical technique for finding the best-fitting straight line for a set of data. Regression allows predictions to be made based on correlations between two variables. The relationship between correlation and regression is examined, noting that correlation determines the relationship between variables while regression is used to make predictions. Various aspects of the linear regression equation are described, including computing predictions, graphing lines, and determining how well data fits the regression line.
The document discusses bivariate distribution and correlation. It defines bivariate distribution as a distribution where each individual or unit of a set assumes two values, relating to two different variables. Correlation analyzes the relationship between two variables in a bivariate distribution. There are different types of correlation like positive, negative, no correlation, perfect correlation and weak/strong correlation. The coefficient of correlation, calculated using Pearson's method, measures the degree of association between two related variables. Regression analysis involves predicting the value of one variable based on the known value of another variable if they are significantly correlated.
This document provides an overview of supervised learning techniques, focusing on different types of regression algorithms. It begins with an introduction to regression and discusses simple linear regression, multiple linear regression, and the assumptions of regression analysis. It then covers common regression algorithms like polynomial regression and logistic regression. Key concepts explained include the slope and intercept of linear regression lines, residual errors, and ways to improve regression accuracy like regularization and dimensionality reduction. Logistic regression is highlighted as preferable to linear regression for qualitative response variables with more than two levels.
Regression analysis is used to model relationships between variables. Simple linear regression involves modeling the relationship between a single independent variable and dependent variable. The regression equation estimates the dependent variable (y) as a linear function of the independent variable (x). The parameters β0 and β1 are estimated using the method of least squares. The coefficient of determination (r2) measures how well the regression line fits the data. Additional tests like the t-test, confidence intervals, and F-test are used to test if the independent variable significantly predicts the dependent variable. While these tests can indicate a statistically significant relationship, they do not prove causation.
This document discusses summarizing bivariate data using scatterplots and correlation. It provides an example of fare data from a bus company that is modeled using linear and nonlinear regression. Linear regression finds a strong positive correlation between distance and fare, but the relationship is better modeled nonlinearly using the logarithm of distance. The nonlinear model accounts for 96.9% of variation in fares compared to 84.9% for the linear model.
This document discusses correlation and regression analysis. It defines correlation as assessing the relationship between two variables, while regression determines how well one variable can predict another. Correlation does not imply causation. Pearson's r standardizes the covariance between variables and ranges from -1 to 1, indicating the strength and direction of their linear relationship. Regression finds the best-fitting linear relationship through the least squares method to minimize residuals and predict one variable from another. It provides the slope and intercept of the regression line. The coefficient of determination, r-squared, indicates how well the regression model fits the data.
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
The regression coefficients are 0.8 and 0.2.
The coefficient of correlation r is the geometric mean of the regression coefficients, which is:
√(0.8 × 0.2) = 0.4
Therefore, the value of the coefficient of correlation is 0.4.
Correlation and regression are used to study relationships in bi-variate data. Correlation measures the degree of relatedness between two variables, while regression aims to predict a dependent variable based on independent variables. Pearson's correlation coefficient r measures the linear correlation between two continuous variables from -1 to +1. Linear regression models include simple linear regression with one independent variable and multiple linear regression with multiple independent variables. The method of least squares is used to estimate the regression coefficients by minimizing the sum of the squared errors between observed and predicted values of the dependent variable.
This document provides an overview of regression analysis. It defines regression as a statistical technique for finding the best-fitting straight line for a set of data. Regression allows predictions to be made based on correlations between two variables. The relationship between correlation and regression is examined, noting that correlation determines the relationship between variables while regression is used to make predictions. Various aspects of the linear regression equation are described, including computing predictions, graphing lines, and determining how well data fits the regression line.
The document discusses bivariate distribution and correlation. It defines bivariate distribution as a distribution where each individual or unit of a set assumes two values, relating to two different variables. Correlation analyzes the relationship between two variables in a bivariate distribution. There are different types of correlation like positive, negative, no correlation, perfect correlation and weak/strong correlation. The coefficient of correlation, calculated using Pearson's method, measures the degree of association between two related variables. Regression analysis involves predicting the value of one variable based on the known value of another variable if they are significantly correlated.
This document provides an overview of supervised learning techniques, focusing on different types of regression algorithms. It begins with an introduction to regression and discusses simple linear regression, multiple linear regression, and the assumptions of regression analysis. It then covers common regression algorithms like polynomial regression and logistic regression. Key concepts explained include the slope and intercept of linear regression lines, residual errors, and ways to improve regression accuracy like regularization and dimensionality reduction. Logistic regression is highlighted as preferable to linear regression for qualitative response variables with more than two levels.
Regression analysis is used to model relationships between variables. Simple linear regression involves modeling the relationship between a single independent variable and dependent variable. The regression equation estimates the dependent variable (y) as a linear function of the independent variable (x). The parameters β0 and β1 are estimated using the method of least squares. The coefficient of determination (r2) measures how well the regression line fits the data. Additional tests like the t-test, confidence intervals, and F-test are used to test if the independent variable significantly predicts the dependent variable. While these tests can indicate a statistically significant relationship, they do not prove causation.
This document discusses summarizing bivariate data using scatterplots and correlation. It provides an example of fare data from a bus company that is modeled using linear and nonlinear regression. Linear regression finds a strong positive correlation between distance and fare, but the relationship is better modeled nonlinearly using the logarithm of distance. The nonlinear model accounts for 96.9% of variation in fares compared to 84.9% for the linear model.
Regression and correlation analysis allow researchers to assess relationships between variables. Regression fits a line to two variables that minimizes the sum of squared errors, representing how well the independent variable predicts the dependent variable. Correlation assesses the strength and direction of association, ranging from -1 to 1. R-squared indicates the proportion of variance in the dependent variable explained by the independent variable.
This presentation covered the following topics:
1. Definition of Correlation and Regression
2. Meaning of Correlation and Regression
3. Types of Correlation and Regression
4. Karl Pearson's methods of correlation
5. Bivariate Grouped data method
6. Spearman's Rank correlation Method
7. Scattered diagram method
8. Interpretation of correlation coefficient
9. Lines of Regression
10. regression Equations
11. Difference between correlation and regression
12. Related examples
1. Regression analysis is a statistical technique used to model relationships between variables and make predictions. It can be used to describe relationships, estimate coefficients, make predictions, and control systems.
2. Linear regression models describe straight-line relationships between variables, while non-linear models describe curved relationships. The goodness of fit of a model can be evaluated using the coefficient of determination.
3. The least squares method is used to fit regression lines by minimizing the sum of the squared vertical distances between observed and estimated y-values for a regression of y on x, or minimizing the sum of squared horizontal distances for a regression of x on y.
This document discusses linear regression analysis. It defines simple and multiple linear regression, and explains that regression examines the relationship between independent and dependent variables. The document provides the equations for linear regression analysis, and discusses calculating the slope, intercept, standard error of the estimate, and coefficient of determination. It explains that regression analysis is widely used for prediction and forecasting in areas like advertising and product sales.
The document discusses various statistical concepts related to correlation and regression. It defines the coefficient of correlation as a measure of the strength and direction of the relationship between variables ranging from +1 to -1. A value close to 0 indicates no relationship, while values close to +1 or -1 indicate a strong positive or negative linear relationship, respectively. It also discusses the covariance and correlation of random variables, Pearson correlation coefficient, Spearman rank correlation, partial correlation coefficient, and multiple correlation coefficients. Finally, it provides a definition of regression as a technique to determine the mathematical relationship between two variables using a regression line equation.
Linear regression models the relationship between two variables, where one variable is considered the dependent variable and the other is the independent variable. The linear regression line minimizes the sum of the squared distances between the observed dependent variable values and the predicted dependent variable values. This line can be used to predict the dependent variable value based on new independent variable values. Multiple linear regression extends this to model the relationship between a dependent variable and two or more independent variables. Other types of regression models include nonlinear, generalized linear, and exponential regression.
The document presents a regression analysis on the relationship between driving experience (the independent variable X) and the number of road accidents (the dependent variable Y). It finds the regression line to be Y = 76.66 - 1.5476X, indicating a negative relationship between accidents and experience. Using this line, it estimates the number of accidents would be 61.184 for 10 years experience and 30.232 for 30 years experience. It also calculates the coefficient of determination R2 = 0.5894, meaning driving experience explains around 59% of the variance in road accidents.
Regression.ppt basic introduction of regression with exampleshivshankarshiva98
Regression analysis attempts to explain variation in a dependent variable using independent variables. Simple linear regression fits a straight line to the data using an equation of y=b0+b1x+ε. The coefficient of determination R2 indicates how well the regression line represents the data, ranging from 0 to 1. Multiple linear regression generalizes this to use more than one independent variable to explain the dependent variable.
This document provides an introduction to regression and correlation analysis. It discusses simple and multiple linear regression models, how to interpret regression coefficients, and how to check the assumptions and adequacy of regression models. Key aspects covered include computing the regression line using the least squares method, interpreting the slope and intercept, checking the normality of residuals, and examining residual plots to validate the model. The goal of regression analysis is to model the relationship between a dependent variable and one or more independent variables.
This document discusses relationships between variables in experiments. It defines two types of relationships: functional and statistical. A functional relationship is a perfect mathematical relationship where each value of the independent variable corresponds to a single, unique value of the dependent variable. A statistical relationship is imperfect, with a range of possible dependent variable values for each independent variable value. The document also discusses simple linear regression analysis, how to estimate regression coefficients, and how to interpret them to understand the relationship between variables.
The document discusses simple linear regression and correlation. It explains how to calculate the slope and intercept of a regression line by using a scatterplot of two variables to visualize their relationship. It then shows how to compute Pearson's correlation coefficient r to quantify the strength of the linear relationship, with r closer to 1 indicating a stronger correlation. The example computes the slope, intercept, r, and tests if the correlation is statistically significant for a sample dataset about soda consumption and bathroom trips.
This document provides an overview of regression analysis and two-way tables. It defines key concepts such as regression lines, correlation, residuals, and marginal and conditional distributions. Regression finds the linear relationship between two variables to make predictions. The least squares regression line minimizes the vertical distance between the data points and the line. Correlation and the coefficient of determination r2 measure how well the regression line fits the data. Two-way tables summarize the relationship between two categorical variables through marginal and conditional distributions.
The document discusses various statistical techniques for analyzing the relationship between two variables, including scatter plots, covariance, correlation coefficients, linear regression, and curvilinear regression. It provides formulas and assumptions for each method, and explains how to interpret the results to determine if variables are related and the strength and direction of their relationship.
CORRELATION-AND-REGRESSION.pdf for human resourceSharon517605
This document discusses correlation and regression analysis. It defines correlation as measuring the relationship between two quantitative variables. There are two main correlation coefficients - Pearson's r which measures the strength of a linear relationship between two variables, and Spearman's Rho which measures the monotonic relationship between two ranked variables. The document also discusses scatter plots/diagrams which can help visualize the relationship between two variables, and defines different types of correlations such as positive, negative, simple, partial and multiple correlations. It provides examples of how to calculate Pearson's r correlation coefficient and how to interpret the resulting value.
This document provides an overview of statistical concepts for analyzing experimental data, including z-tests, t-tests, and ANOVAs. It discusses developing experimental hypotheses and distinguishing between null and alternative hypotheses. Key concepts explained include p-values, type I and type II errors, and determining statistical significance. Examples are given of applying a t-test and ANOVA to compare brain volume changes before and after childbirth. Limitations of statistical analyses with respect to including entire populations are also noted.
Linear regression and correlation analysis ppt @ bec domsBabasab Patil
This document introduces linear regression and correlation analysis. It discusses calculating and interpreting the correlation coefficient and linear regression equation to determine the relationship between two variables. It covers scatter plots, the assumptions of regression analysis, and using regression to predict and describe relationships in data. Key terms introduced include the correlation coefficient, linear regression model, explained and unexplained variation, and the coefficient of determination.
Regression and correlation analysis allow researchers to assess relationships between variables. Regression fits a line to two variables that minimizes the sum of squared errors, representing how well the independent variable predicts the dependent variable. Correlation assesses the strength and direction of association, ranging from -1 to 1. R-squared indicates the proportion of variance in the dependent variable explained by the independent variable.
This presentation covered the following topics:
1. Definition of Correlation and Regression
2. Meaning of Correlation and Regression
3. Types of Correlation and Regression
4. Karl Pearson's methods of correlation
5. Bivariate Grouped data method
6. Spearman's Rank correlation Method
7. Scattered diagram method
8. Interpretation of correlation coefficient
9. Lines of Regression
10. regression Equations
11. Difference between correlation and regression
12. Related examples
1. Regression analysis is a statistical technique used to model relationships between variables and make predictions. It can be used to describe relationships, estimate coefficients, make predictions, and control systems.
2. Linear regression models describe straight-line relationships between variables, while non-linear models describe curved relationships. The goodness of fit of a model can be evaluated using the coefficient of determination.
3. The least squares method is used to fit regression lines by minimizing the sum of the squared vertical distances between observed and estimated y-values for a regression of y on x, or minimizing the sum of squared horizontal distances for a regression of x on y.
This document discusses linear regression analysis. It defines simple and multiple linear regression, and explains that regression examines the relationship between independent and dependent variables. The document provides the equations for linear regression analysis, and discusses calculating the slope, intercept, standard error of the estimate, and coefficient of determination. It explains that regression analysis is widely used for prediction and forecasting in areas like advertising and product sales.
The document discusses various statistical concepts related to correlation and regression. It defines the coefficient of correlation as a measure of the strength and direction of the relationship between variables ranging from +1 to -1. A value close to 0 indicates no relationship, while values close to +1 or -1 indicate a strong positive or negative linear relationship, respectively. It also discusses the covariance and correlation of random variables, Pearson correlation coefficient, Spearman rank correlation, partial correlation coefficient, and multiple correlation coefficients. Finally, it provides a definition of regression as a technique to determine the mathematical relationship between two variables using a regression line equation.
Linear regression models the relationship between two variables, where one variable is considered the dependent variable and the other is the independent variable. The linear regression line minimizes the sum of the squared distances between the observed dependent variable values and the predicted dependent variable values. This line can be used to predict the dependent variable value based on new independent variable values. Multiple linear regression extends this to model the relationship between a dependent variable and two or more independent variables. Other types of regression models include nonlinear, generalized linear, and exponential regression.
The document presents a regression analysis on the relationship between driving experience (the independent variable X) and the number of road accidents (the dependent variable Y). It finds the regression line to be Y = 76.66 - 1.5476X, indicating a negative relationship between accidents and experience. Using this line, it estimates the number of accidents would be 61.184 for 10 years experience and 30.232 for 30 years experience. It also calculates the coefficient of determination R2 = 0.5894, meaning driving experience explains around 59% of the variance in road accidents.
Regression.ppt basic introduction of regression with exampleshivshankarshiva98
Regression analysis attempts to explain variation in a dependent variable using independent variables. Simple linear regression fits a straight line to the data using an equation of y=b0+b1x+ε. The coefficient of determination R2 indicates how well the regression line represents the data, ranging from 0 to 1. Multiple linear regression generalizes this to use more than one independent variable to explain the dependent variable.
This document provides an introduction to regression and correlation analysis. It discusses simple and multiple linear regression models, how to interpret regression coefficients, and how to check the assumptions and adequacy of regression models. Key aspects covered include computing the regression line using the least squares method, interpreting the slope and intercept, checking the normality of residuals, and examining residual plots to validate the model. The goal of regression analysis is to model the relationship between a dependent variable and one or more independent variables.
This document discusses relationships between variables in experiments. It defines two types of relationships: functional and statistical. A functional relationship is a perfect mathematical relationship where each value of the independent variable corresponds to a single, unique value of the dependent variable. A statistical relationship is imperfect, with a range of possible dependent variable values for each independent variable value. The document also discusses simple linear regression analysis, how to estimate regression coefficients, and how to interpret them to understand the relationship between variables.
The document discusses simple linear regression and correlation. It explains how to calculate the slope and intercept of a regression line by using a scatterplot of two variables to visualize their relationship. It then shows how to compute Pearson's correlation coefficient r to quantify the strength of the linear relationship, with r closer to 1 indicating a stronger correlation. The example computes the slope, intercept, r, and tests if the correlation is statistically significant for a sample dataset about soda consumption and bathroom trips.
This document provides an overview of regression analysis and two-way tables. It defines key concepts such as regression lines, correlation, residuals, and marginal and conditional distributions. Regression finds the linear relationship between two variables to make predictions. The least squares regression line minimizes the vertical distance between the data points and the line. Correlation and the coefficient of determination r2 measure how well the regression line fits the data. Two-way tables summarize the relationship between two categorical variables through marginal and conditional distributions.
The document discusses various statistical techniques for analyzing the relationship between two variables, including scatter plots, covariance, correlation coefficients, linear regression, and curvilinear regression. It provides formulas and assumptions for each method, and explains how to interpret the results to determine if variables are related and the strength and direction of their relationship.
CORRELATION-AND-REGRESSION.pdf for human resourceSharon517605
This document discusses correlation and regression analysis. It defines correlation as measuring the relationship between two quantitative variables. There are two main correlation coefficients - Pearson's r which measures the strength of a linear relationship between two variables, and Spearman's Rho which measures the monotonic relationship between two ranked variables. The document also discusses scatter plots/diagrams which can help visualize the relationship between two variables, and defines different types of correlations such as positive, negative, simple, partial and multiple correlations. It provides examples of how to calculate Pearson's r correlation coefficient and how to interpret the resulting value.
This document provides an overview of statistical concepts for analyzing experimental data, including z-tests, t-tests, and ANOVAs. It discusses developing experimental hypotheses and distinguishing between null and alternative hypotheses. Key concepts explained include p-values, type I and type II errors, and determining statistical significance. Examples are given of applying a t-test and ANOVA to compare brain volume changes before and after childbirth. Limitations of statistical analyses with respect to including entire populations are also noted.
Linear regression and correlation analysis ppt @ bec domsBabasab Patil
This document introduces linear regression and correlation analysis. It discusses calculating and interpreting the correlation coefficient and linear regression equation to determine the relationship between two variables. It covers scatter plots, the assumptions of regression analysis, and using regression to predict and describe relationships in data. Key terms introduced include the correlation coefficient, linear regression model, explained and unexplained variation, and the coefficient of determination.
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
202406 - Cape Town Snowflake User Group - LLM & RAG.pdfDouglas Day
Content from the July 2024 Cape Town Snowflake User Group focusing on Large Language Model (LLM) functions in Snowflake Cortex. Topics include:
Prompt Engineering.
Vector Data Types and Vector Functions.
Implementing a Retrieval
Augmented Generation (RAG) Solution within Snowflake
Dive into the details of how to leverage these advanced features without leaving the Snowflake environment.
2. Topics Covered:
Is there a relationship between x and y?
What is the strength of this relationship
Pearson’s r
Can we describe this relationship and use this to predict y from
x?
Regression
Is the relationship we have described statistically significant?
t test
Relevance to SPM
GLM
3. The relationship between x and y
Correlation: is there a relationship between 2
variables?
Regression: how well a certain independent
variable predict dependent variable?
CORRELATION CAUSATION
In order to infer causality: manipulate independent
variable and observe effect on dependent variable
5. Variance vs Covariance
First, a note on your sample:
If you’re wishing to assume that your sample is
representative of the general population (RANDOM
EFFECTS MODEL), use the degrees of freedom (n – 1)
in your calculations of variance or covariance.
But if you’re simply wanting to assess your current
sample (FIXED EFFECTS MODEL), substitute n for
the degrees of freedom.
6. Variance vs Covariance
Do two variables change together?
1
)
)(
(
)
,
cov( 1
n
y
y
x
x
y
x
i
n
i
i
Covariance:
• Gives information on the degree to
which two variables vary together.
• Note how similar the covariance is to
variance: the equation simply
multiplies x’s error scores by y’s error
scores as opposed to squaring x’s error
scores.
1
)
( 2
1
2
n
x
x
S
n
i
i
x
Variance:
• Gives information on variability of a
single variable.
7. Covariance
When X and Y : cov (x,y) = pos.
When X and Y : cov (x,y) = neg.
When no constant relationship: cov (x,y) = 0
1
)
)(
(
)
,
cov( 1
n
y
y
x
x
y
x
i
n
i
i
8. Example Covariance
x y x
xi
y
yi
( x
i
x )( y
i
y )
0 3 -3 0 0
2 2 -1 -1 1
3 4 0 1 0
4 0 1 -3 -3
6 6 3 3 9
3
x 3
y 7
75
.
1
4
7
1
))
)(
(
)
,
cov( 1
n
y
y
x
x
y
x
i
n
i
i What does this
number tell us?
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
9. Problem with Covariance:
The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
10. Example of how covariance value
relies on variance
High variance data Low variance data
Subject x y x error * y
error
x y X error * y
error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50
Sum of x error * y error : 7000 Sum of x error * y error : 28
Covariance: 1166.67 Covariance: 4.67
11. Solution: Pearson’s r
Covariance does not really tell us anything
Solution: standardise this measure
Pearson’s R: standardises the covariance value.
Divides the covariance by the multiplied standard deviations of
X and Y:
y
x
xy
s
s
y
x
r
)
,
cov(
12. Pearson’s R continued
1
)
)(
(
)
,
cov( 1
n
y
y
x
x
y
x
i
n
i
i
y
x
i
n
i
i
xy
s
s
n
y
y
x
x
r
)
1
(
)
)(
(
1
1
*
1
n
Z
Z
r
n
i
y
x
xy
i
i
13. Limitations of r
When r = 1 or r = -1:
We can predict y from x with certainty
all data points are on a straight line: y = ax + b
r is actually
r = true r of whole population
= estimate of r based on data
r is very sensitive to extreme values:
0
1
2
3
4
5
0 1 2 3 4 5 6
r̂
r̂
14. Regression
Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.
To do this we need REGRESSION!
15. Best-fit Line
= ŷ, predicted value
Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that
gives best prediction of y for any value of x
This will be the line that
minimises distance between
data and fitted line, i.e.
the residuals
intercept
ε
ŷ = ax + b
ε = residual error
= y i , true value
slope
16. Least Squares Regression
To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
Model line: ŷ = ax + b
we must find values of a and b that minimise
Σ (y – ŷ)2
a = slope, b = intercept
17. Finding b
First we find the value of b that gives the min
sum of squares
ε ε
b
b
b
Trying different values of b is equivalent to
shifting the line up and down the scatter plot
18. Finding a
Now we find the value of a that gives the min
sum of squares
b b b
Trying out different values of a is equivalent to
changing the slope of the line, while b stays
constant
19. Minimising sums of squares
Need to minimise Σ(y–ŷ)2
ŷ = ax + b
so need to minimise:
Σ(y - ax - b)2
If we plot the sums of squares
for all different values of a and b
we get a parabola, because it is a
squared term
So the min sum of squares is at
the bottom of the curve, where
the gradient is zero.
Values of a and b
sums
of
squares
(S)
Gradient = 0
min S
20. The maths bit
The min sum of squares is at the bottom of the curve
where the gradient = 0
So we can find a and b that give min sum of squares
by taking partial derivatives of Σ(y - ax - b)2 with
respect to a and b separately
Then we solve these for 0 to give us the values of a
and b that give the min sum of squares
21. The solution
Doing this gives the following equations for a and b:
a =
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
From you can see that:
A low correlation coefficient gives a flatter slope (small value of
a)
Large spread of y, i.e. high standard deviation, results in a
steeper slope (high value of a)
Large spread of x, i.e. high standard deviation, results in a flatter
slope (high value of a)
22. The solution cont.
Our model equation is ŷ = ax + b
This line must pass through the mean so:
y = ax + b b = y – ax
We can put our equation for a into this giving:
b = y – ax
b = y -
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
x
The smaller the correlation, the closer the
intercept is to the mean of y
23. Back to the model
If the correlation is zero, we will simply predict the mean of y for every
value of x, and our regression line is just a flat straight line crossing the
x-axis at y
But this isn’t very useful.
We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
ŷ = ax + b =
r sy
sx
r sy
sx
x + y - x
r sy
sx
ŷ = (x – x) + y
Rearranges to:
a b
a a
24. How good is our model?
Total variance of y: sy
2 =
∑(y – y)2
n - 1
SSy
dfy
=
Variance of predicted y values (ŷ):
Error variance:
sŷ
2 =
∑(ŷ – y)2
n - 1
SSpred
dfŷ
=
This is the variance
explained by our
regression model
serror
2 =
∑(y – ŷ)2
n - 2
SSer
dfer
=
This is the variance of the error
between our predicted y values and
the actual y values, and thus is the
variance in y that is NOT explained
by the regression model
25. Total variance = predicted variance + error variance
sy
2 = sŷ
2 + ser
2
Conveniently, via some complicated rearranging
sŷ
2 = r2 sy
2
r2 = sŷ
2 / sy
2
so r2 is the proportion of the variance in y that is explained by
our regression model
How good is our model cont.
26. How good is our model cont.
Insert r2 sy
2 into sy
2 = sŷ
2 + ser
2 and rearrange to get:
ser
2 = sy
2 – r2sy
2
= sy
2 (1 – r2)
From this we can see that the greater the correlation
the smaller the error variance, so the better our
prediction
27. Is the model significant?
i.e. do we get a significantly better prediction of y
from our regression equation than by just predicting
the mean?
F-statistic:
F(dfŷ,dfer) =
sŷ
2
ser
2
=......=
r2 (n - 2)2
1 – r2
complicated
rearranging
And it follows that:
t(n-2) =
r (n - 2)
√1 – r2
(because F = t2)
So all we need to
know are r and n
28. General Linear Model
Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
A General Linear Model is just any model that
describes the data in terms of a straight line
29. Multiple regression
Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
The different x variables are combined in a linear way and
each has its own regression coefficient:
y = a1x1+ a2x2 +…..+ anxn + b + ε
The a parameters reflect the independent contribution of each
independent variable, x, to the value of the dependent variable,
y.
i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
30. SPM
Linear regression is a GLM that models the effect of one
independent variable, x, on ONE dependent variable, y
Multiple Regression models the effect of several independent
variables, x1, x2 etc, on ONE dependent variable, y
Both are types of General Linear Model
GLM can also allow you to analyse the effects of several
independent x variables on several dependent variables, y1, y2,
y3 etc, in a linear combination
This is what SPM does and all will be explained next week!