reading notes for code fellows
Linear regression will be important going forwards in analyzing and predicting data from data sets.
Scikit-learn is a Python module used for machine learning that contains functions for regression, classification, clustering, model selection, and dimensionality reduction. For linear regression, you may want to use the linear_model module, which contains methods intended for regression in which the target value is expected to be a linear combination of the input values. The first step is always to import your data, but once you’ve done that and set it up as a pandas dataframe you can import LinearRegression from the linear_model module and create a variable for it. The most important functions within this for what we’re doing are
.fit() - fits a linear model.predict()- predict the target data using the linear model with estimated coefficients.score() - returns the coefficient of determination.coef_ - gives the coefficients.intercept_ - gives the estimated interceptsWith the coefficients and intercept, you can create a useful dataframe to see their correlation and put the corresponding data into a .scatter() plot. Using .predict(x)[range], you can plot your predictions against the actual data. Since this often produces some errors, you can use .mean((original data - prediction) ** 2) to get the mean squared error.
Generally, when actually implementing linear regression on real data, you won’t use the entire data set, it will instead be split into training and test data sets. To properly randomize the data split, scikit has a function called train_test_split to do it for you. To test whether you still have errors in your data, you can use a residual plot, which should leave your data points randomly scattered around the zero line without structure.
I’m still not 100% clear on where some of the commands in the demo were coming from, and I will be excited to go a bit more in depth into all of the steps here.
Linear regression within coding seems a lot like linear regression in mathematics in that I think the hardest piece of it is just going to be figuring out all the values rather than figuring out how to code it.