Linear Regression Using Pandas

H
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `RELIANCE - NSE Stock Data`\n",
    "\n",
    "The file contains RELIANCE - NSE Stock Data from 1-Jan-16 to 6-May-21\n",
    "\n",
    "The data can be used to forecast the stock prices of the future\n",
    "\n",
    "Its a timeseries data from the national stock exchange of India\n",
    "\n",
    "|| `Variable` | `Significance` |\n",
    "| ------------- |:-------------:|:-------------:|\n",
    "|1.|Symbol|Symbol of the listed stock on NSE|\n",
    "|2.|Series|To which series does the stock belong (Equity, Options Future)|\n",
    "|3.|Date|Date of the trade|\n",
    "|4.|Prev Close|Previous day closing value of the stock|\n",
    "|5.|Open Price|Current Day opening price of the stock|\n",
    "|6.|High Price|Highest price touched by the stock in current day `(Target Variable)`|\n",
    "|7.|Low Price|lowest price touched by the stock in current day|\n",
    "|8.|Last Price|The price at which last trade occured in current day|\n",
    "|9.|Close Price|Current day closing price of the stock|\n",
    "|10.|Average Price|Average price of the day|\n",
    "|11.|Total Traded Quantity|Total number of stocks traded in current day|\n",
    "|12.|Turnover||\n",
    "|13.|No. of Trades|Current day's total number of trades|\n",
    "|14.|Deliverabel Quantity|Current day deliveable quantity to the traders|\n",
    "|15.|% Dly Qt to Traded Qty|`(Deliverable Quantity/Total Traded Quantity)*100`|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data_path=\"./data/RILO - Copy.csv\"\n",
    "data=pd.read_csv(data_path)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Renaming the columns to have snake_case naming style. (Just as a convention and for convenience)\n",
    "data.columns=[\"_\".join(column.lower().split()) for column in data.columns]\n",
    "data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using `.describe()` on an entire DataFrame we can get a summary of the distribution of continuous variables:\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Checking for null values\n",
    "data.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### As shown above, we do not have any null values in our dataset. Now we can focus on feature selection and model building."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# By using the correlation method `.corr()` we can get the relationship between each continuous variable:\n",
    "correlation=data.corr()\n",
    "correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `Matplotlib`\n",
    "\n",
    "Matplotlib is a visualization library in Python for 2D plots of arrays. Matplotlib is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader SciPy stack.\n",
    "\n",
    "One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram etc.\n",
    "\n",
    "Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information\n",
    "\n",
    "### `Seaborn`\n",
    "\n",
    "Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using seaborn and matplotlib to have a better visualization of correlation\n",
    "import seaborn as sn\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(figsize=(10,8))\n",
    "sn.heatmap(correlation,annot=True,linewidth=1,cmap='PuOr')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above correlation matrix, we get a general idea of which variables can be treated as features to build our model. Lets list them out\n",
    "Considering all the variables having `|corr|>=0.5`\n",
    "\n",
    "- prev_close\n",
    "- no._of_trades\n",
    "- open_price\n",
    "- low_price\n",
    "- last_price\n",
    "- turnover\n",
    "- close_price\n",
    "- %_dly_qt_to_traded_qty\n",
    "- average_price\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Now that we have a rough idea about our features, lets confirm their behaviour aginst target variable using scatter plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(18,18))\n",
    "\n",
    "plt.subplot(3,3,1)\n",
    "plt.scatter(data.prev_close,data.high_price)\n",
    "plt.title('Relation with Previous Closing Price')\n",
    "\n",
    "plt.subplot(3,3,2)\n",
    "plt.scatter(data['no._of_trades'],data.high_price)\n",
    "plt.title('Relation with No. of trades')\n",
    "\n",
    "plt.subplot(3,3,3)\n",
    "plt.scatter(data.open_price,data.high_price)\n",
    "plt.title('Relation with Opening Price')\n",
    "\n",
    "plt.subplot(3,3,4)\n",
    "plt.scatter(data.low_price,data.high_price)\n",
    "plt.title('Relation with Low Price')\n",
    "\n",
    "plt.subplot(3,3,5)\n",
    "plt.scatter(data.last_price,data.high_price)\n",
    "plt.title('Relation with Last Price')\n",
    "\n",
    "plt.subplot(3,3,6)\n",
    "plt.scatter(data.turnover,data.high_price)\n",
    "plt.title('Relation with Turnover')\n",
    "\n",
    "plt.subplot(3,3,7)\n",
    "plt.scatter(data.close_price,data.high_price)\n",
    "plt.title('Relation with Closing Price')\n",
    "\n",
    "plt.subplot(3,3,8)\n",
    "plt.scatter(data['%_dly_qt_to_traded_qty'],data.high_price)\n",
    "plt.title('Relation with Deliverable quantity')\n",
    "\n",
    "plt.subplot(3,3,9)\n",
    "plt.scatter(data.average_price,data.high_price)\n",
    "plt.title('Relation with Average Price')\n",
    "\n",
    "plt.show"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From above visualization, we are clear to choose features for the linear-regression model. Those are:\n",
    "\n",
    "- prev_close\n",
    "- ~~no._of_trades~~\n",
    "- open_price\n",
    "- low_price\n",
    "- last_price\n",
    "- ~~turnover~~\n",
    "- close_price\n",
    "- ~~%_dly_qt_to_traded_qty~~\n",
    "- average_price\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features=['prev_close','open_price','low_price','last_price','close_price','average_price']\n",
    "X=data[features]\n",
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Target variable\n",
    "y=data.high_price\n",
    "y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split data into training and validation data, for both features and target\n",
    "# The split is based on a random number generator. Supplying a numeric value to\n",
    "# the random_state argument guarantees we get the same split every time we\n",
    "# run this script.\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "train_X,val_X,train_y,val_y=train_test_split(X,y,test_size=0.2,random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "# Define model\n",
    "model=LinearRegression()\n",
    "# Fit model\n",
    "model.fit(train_X,train_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We use .score method to get an idea of quality of our model\n",
    "model.score(val_X,val_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Validation\n",
    "There are many metrics for summarizing model quality, but we'll start with one called `Mean Absolute Error (also called MAE)`. Let's break down this metric starting with the last word, error.\n",
    "\n",
    "\n",
    "`error=actual-predicted`\n",
    "\n",
    "So, if a stock cost Rs.4000 at a timeframe, and we predicted it would cost Rs.3980 the error is Rs.20.\n",
    "\n",
    "With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as\n",
    "\n",
    "> On average, our predictions are off by about X.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "# Get predicted prices of stock on validation data\n",
    "pred_y=model.predict(val_X)\n",
    "mean_absolute_error(val_y,pred_y)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
  },
  "kernelspec": {
   "display_name": "Python 3.9.7 64-bit",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
About this Algorithm

RELIANCE - NSE Stock Data

The file contains RELIANCE - NSE Stock Data from 1-Jan-16 to 6-May-21

The data can be used to forecast the stock prices of the future

Its a timeseries data from the national stock exchange of India

Variable Significance
1. Symbol Symbol of the listed stock on NSE
2. Series To which series does the stock belong (Equity, Options Future)
3. Date Date of the trade
4. Prev Close Previous day closing value of the stock
5. Open Price Current Day opening price of the stock
6. High Price Highest price touched by the stock in current day (Target Variable)
7. Low Price lowest price touched by the stock in current day
8. Last Price The price at which last trade occured in current day
9. Close Price Current day closing price of the stock
10. Average Price Average price of the day
11. Total Traded Quantity Total number of stocks traded in current day
12. Turnover
13. No. of Trades Current day's total number of trades
14. Deliverabel Quantity Current day deliveable quantity to the traders
15. % Dly Qt to Traded Qty (Deliverable Quantity/Total Traded Quantity)*100
import pandas as pd

data_path="./data/RILO - Copy.csv"
data=pd.read_csv(data_path)
data
# Renaming the columns to have snake_case naming style. (Just as a convention and for convenience)
data.columns=["_".join(column.lower().split()) for column in data.columns]
data.columns
# Using `.describe()` on an entire DataFrame we can get a summary of the distribution of continuous variables:
data.describe()
# Checking for null values
data.isnull().sum()

As shown above, we do not have any null values in our dataset. Now we can focus on feature selection and model building.

# By using the correlation method `.corr()` we can get the relationship between each continuous variable:
correlation=data.corr()
correlation

Matplotlib

Matplotlib is a visualization library in Python for 2D plots of arrays. Matplotlib is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader SciPy stack.

One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram etc.

Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information

Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

# Using seaborn and matplotlib to have a better visualization of correlation
import seaborn as sn
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
sn.heatmap(correlation,annot=True,linewidth=1,cmap='PuOr')
plt.show()

From the above correlation matrix, we get a general idea of which variables can be treated as features to build our model. Lets list them out Considering all the variables having |corr|>=0.5

  • prev_close
  • no._of_trades
  • open_price
  • low_price
  • last_price
  • turnover
  • close_price
  • %_dly_qt_to_traded_qty
  • average_price

Now that we have a rough idea about our features, lets confirm their behaviour aginst target variable using scatter plots.

plt.figure(figsize=(18,18))

plt.subplot(3,3,1)
plt.scatter(data.prev_close,data.high_price)
plt.title('Relation with Previous Closing Price')

plt.subplot(3,3,2)
plt.scatter(data['no._of_trades'],data.high_price)
plt.title('Relation with No. of trades')

plt.subplot(3,3,3)
plt.scatter(data.open_price,data.high_price)
plt.title('Relation with Opening Price')

plt.subplot(3,3,4)
plt.scatter(data.low_price,data.high_price)
plt.title('Relation with Low Price')

plt.subplot(3,3,5)
plt.scatter(data.last_price,data.high_price)
plt.title('Relation with Last Price')

plt.subplot(3,3,6)
plt.scatter(data.turnover,data.high_price)
plt.title('Relation with Turnover')

plt.subplot(3,3,7)
plt.scatter(data.close_price,data.high_price)
plt.title('Relation with Closing Price')

plt.subplot(3,3,8)
plt.scatter(data['%_dly_qt_to_traded_qty'],data.high_price)
plt.title('Relation with Deliverable quantity')

plt.subplot(3,3,9)
plt.scatter(data.average_price,data.high_price)
plt.title('Relation with Average Price')

plt.show

From above visualization, we are clear to choose features for the linear-regression model. Those are:

  • prev_close
  • no._of_trades
  • open_price
  • low_price
  • last_price
  • turnover
  • close_price
  • %_dly_qt_to_traded_qty
  • average_price
features=['prev_close','open_price','low_price','last_price','close_price','average_price']
X=data[features]
X
# Target variable
y=data.high_price
y
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.

from sklearn.model_selection import train_test_split
train_X,val_X,train_y,val_y=train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression
# Define model
model=LinearRegression()
# Fit model
model.fit(train_X,train_y)
# We use .score method to get an idea of quality of our model
model.score(val_X,val_y)

Model Validation

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.

error=actual-predicted

So, if a stock cost Rs.4000 at a timeframe, and we predicted it would cost Rs.3980 the error is Rs.20.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

On average, our predictions are off by about X.

from sklearn.metrics import mean_absolute_error
# Get predicted prices of stock on validation data
pred_y=model.predict(val_X)
mean_absolute_error(val_y,pred_y)