Linear Regression Calculator
Find the best-fit line (y = mx + b) using least squares regression.
What Is the Linear Regression Calculator?
Linear regression finds the best-fit straight line through a set of data points by minimizing the sum of squared residuals (vertical distances from each point to the line). It quantifies the strength of the linear relationship and enables predictions at new x values with confidence intervals.
- ›Least squares line: computes slope m and intercept b that minimize Σ(yᵢ − ŷᵢ)².
- ›R² and correlation r: R² tells you what fraction of variance in y is explained by x; r indicates direction and strength.
- ›Residuals table: shows each data point, predicted value ŷᵢ, and residual eᵢ = yᵢ − ŷᵢ.
- ›Prediction: enter a new x value to get the predicted y with 95% confidence interval.
- ›Outlier detection: highlights points with large standardized residuals (>2 standard deviations).
- ›Statistics panel: shows mean, standard deviation of x and y, sample size, and standard error of regression.
Formula
| Quantity | Formula | Meaning |
|---|---|---|
| Slope m | m = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² | Rate of change of y per unit x |
| Intercept b | b = ȳ − m·x̄ | Predicted y when x = 0 |
| Correlation r | r = Σ(xᵢ−x̄)(yᵢ−ȳ) / (n·σₓ·σᵧ) | −1 to +1; strength of linear relation |
| R² (R-squared) | R² = r² | % of variance in y explained by x |
| Residual eᵢ | eᵢ = yᵢ − ŷᵢ | Difference between actual and predicted |
How to Use
- 1Enter your data points as comma-separated pairs or paste a table. Each row is one observation (x value, y value).
- 2Press Calculate (or Enter) to see slope m, intercept b, equation of the regression line, R², correlation r, and residuals.
- 3Enter an x value in the "Predict y" field to get a predicted y and 95% confidence interval.
- 4Toggle "Show residuals table" to see the full table of actual vs predicted values with residuals.
- 5Toggle "Show statistics" to see descriptive statistics for both variables (mean, SD, range).
- 6Use "Load sample data" buttons (height/weight, study hours/GPA, year/temperature) to explore with real-looking data.
- 7Click Reset (or Escape) to clear all data and results.
Example Calculation
Example, Study Hours vs Test Score
Data: (2,55), (3,65), (4,72), (5,78), (6,85), (7,90)
Residuals
| x | y (actual) | ŷ (predicted) | Residual |
|---|---|---|---|
| 2 | 55 | 61.67 | −6.67 |
| 3 | 65 | 66.67 | −1.67 |
| 4 | 72 | 71.67 | 0.33 |
| 5 | 78 | 76.67 | 1.33 |
| 6 | 85 | 81.67 | 3.33 |
| 7 | 90 | 86.67 | 3.33 |
Understanding Linear Regression
What Is Linear Regression?
Linear regression is the most widely used statistical technique for modeling the relationship between a continuous outcome variable (y) and one or more predictor variables (x). Simple linear regression uses one predictor; multiple linear regression uses several. The "linear" refers to the linearity of the parameters (m and b), not necessarily the shape of the relationship.
The least squares principle, minimize the sum of squared vertical distances from each data point to the regression line, provides a unique, analytically closed-form solution. This solution is the BLUE (Best Linear Unbiased Estimator) under the Gauss-Markov conditions: linearity, independence, equal variance, and normality of errors.
Interpreting R² and Correlation r
- ›r (Pearson correlation): measures the strength and direction of the linear relationship. r = +1 means perfect positive correlation; r = −1 means perfect negative; r = 0 means no linear association.
- ›R² (coefficient of determination): r², the proportion of variance in y explained by x. R² = 0.85 means 85% of variability in y is accounted for by x; the remaining 15% is unexplained (due to other factors or random noise).
- ›Interpretation thresholds (rough guidelines): R² < 0.3 = weak; 0.3–0.7 = moderate; > 0.7 = strong (but this varies enormously by field, 0.6 is excellent in social science, weak in physics).
- ›Important caveat: high R² does not imply causation, and it does not mean the model is well-specified, always check the residual plot for patterns.
Residual Analysis
Residuals (eᵢ = yᵢ − ŷᵢ) are the leftovers after the regression line explains what it can. Ideal residuals are random, no patterns. Patterns in residuals reveal problems:
- ›Funnel shape (heteroscedasticity): variance increases with x, violates equal-variance assumption; log transform often helps.
- ›Curved pattern: the true relationship is non-linear; consider adding x² term or using a different model form.
- ›Points far from zero: potential outliers. Standardized residuals beyond ±2 warrant investigation; beyond ±3 are very unusual.
- ›Serial correlation: in time series, consecutive residuals similar in sign, violates independence; Durbin-Watson test detects this.
Regression vs Correlation
- ›Correlation is symmetric, the correlation between x and y equals the correlation between y and x. It measures association, not prediction.
- ›Regression is asymmetric, regressing y on x differs from regressing x on y. It gives a prediction equation with y as the outcome and x as the predictor.
- ›Causation: regression and correlation measure association only. Establishing causation requires a controlled experiment or careful causal reasoning about confounders, selection bias, and reverse causation.
Frequently Asked Questions
What does R² mean and what is a "good" R²?
R² (R-squared, coefficient of determination) is the proportion of variance in y that is explained by x, ranging from 0 to 1. R² = 0.8 means 80% of the variability in your outcome is captured by the regression line; the remaining 20% is unexplained noise or other factors.
What counts as "good" depends entirely on the field. In controlled physics experiments, R² < 0.99 might be poor. In social science surveys predicting human behavior, R² = 0.3 can be very strong. R² alone doesn't tell you if the model is correctly specified, always examine the residual plot for patterns like curvature or funneling.
What is the difference between correlation and regression?
Correlation (r) measures the strength and direction of the linear association between two variables, it is symmetric (correlation of x with y equals y with x) and unitless, ranging from −1 to +1.
Regression gives a prediction equation: for each unit increase in x, y changes by m units. Regression is asymmetric, regressing y on x differs from regressing x on y, and produces actionable predictions in specific units. Use correlation for "are these related?"; use regression for "how much does y change per unit of x?"
How do outliers affect regression?
Outliers, especially "high leverage" points far from the mean of x, can dramatically distort the slope and intercept. A single extreme point can drag the regression line toward it, inflating or deflating the slope.
Best practices:
- ›Always plot your data before running regression, visualize before you calculate
- ›Examine standardized residuals: values beyond ±2 warrant investigation; beyond ±3 are very unusual
- ›Use robust regression methods (Theil-Sen estimator, Huber regression) when outliers are present
- ›Investigate outliers before removing them, they may represent genuine data or measurement errors
Can I use regression to prove causation?
No. Regression quantifies association, not causation. A strong R² between ice cream sales and drowning rates does not mean ice cream causes drowning, both are driven by the confounding variable of hot weather and outdoor activity.
Establishing causation requires either a randomized controlled experiment (randomly assigning x values) or careful causal reasoning ruling out confounders, selection bias, and reverse causation. Observational-study techniques like instrumental variables, difference-in-differences, and regression discontinuity can support causal claims, but each requires specific assumptions that must be defended on their own merits.
What is the standard error of the regression (SEE)?
The standard error of the estimate (SEE) measures the typical vertical distance from actual y values to the regression line: SEE = √(Σeᵢ²/(n−2)). The n−2 denominator accounts for the two estimated parameters (slope and intercept).
SEE serves two purposes: it summarizes goodness-of-fit (small SEE = tight fit), and it is used to calculate prediction intervals. An approximate 95% prediction interval for a new observation is ŷ ± 2×SEE for large samples at moderate x values, narrower near the mean of x, wider at the extremes.
When is linear regression not appropriate?
Linear regression is inappropriate in several situations:
- ›True relationship is clearly non-linear (exponential growth, power law), try log transform
- ›Outcome variable is binary (yes/no), use logistic regression
- ›Residuals show strong heteroscedasticity (variance increases with x), use weighted least squares
- ›Observations are not independent (repeated measures, time series), use mixed-effects or ARIMA models
- ›Predictor has a circular scale (angles, compass bearings), use circular statistics
The most common violation is non-linearity. Always visualize the data as a scatter plot before fitting. If the relationship looks curved, transform variables or use a non-linear model.
What is multiple regression and how is it different?
Multiple (multivariate) linear regression extends simple regression to multiple predictors: ŷ = b₀ + b₁x₁ + b₂x₂ + … + bₚxₚ. Each coefficient shows the independent effect of that variable while holding all others constant.
For example, predicting salary from both years of experience (x₁) and education level (x₂): b₁ is the salary increase per year of experience after controlling for education, and b₂ is the education premium after controlling for experience. This ability to "control for" confounders makes multiple regression essential in observational research. This tool handles simple (one predictor) regression; multiple regression requires matrix algebra or statistical software.