#### House Price Prediction

A
```{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Finding Best ML Algorithm for House Price Prediction using k Cross Validation and GridSearchCV. "
]
},
{
"cell_type": "markdown",
"source": [
"In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work accurately for the real data. For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. For this purpose, we use the cross-validation technique.Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set."
]
},
{
"cell_type": "markdown",
"source": [
]
},
{
"cell_type": "code",
"execution_count": 72,
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from matplotlib import pyplot as plt\n",
"%matplotlib inline \n",
"import matplotlib\n",
"matplotlib.rcParams[\"figure.figsize\"]=(20,10)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"outputs": [],
"source": [
]
},
{
"cell_type": "code",
"execution_count": 74,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>area_type</th>\n",
"      <th>availability</th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>society</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>balcony</th>\n",
"      <th>price</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>Super built-up  Area</td>\n",
"      <td>19-Dec</td>\n",
"      <td>Electronic City Phase II</td>\n",
"      <td>2 BHK</td>\n",
"      <td>Coomee</td>\n",
"      <td>1056</td>\n",
"      <td>2.0</td>\n",
"      <td>1.0</td>\n",
"      <td>39.07</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>Plot  Area</td>\n",
"      <td>Chikka Tirupathi</td>\n",
"      <td>4 Bedroom</td>\n",
"      <td>Theanmp</td>\n",
"      <td>2600</td>\n",
"      <td>5.0</td>\n",
"      <td>3.0</td>\n",
"      <td>120.00</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>Built-up  Area</td>\n",
"      <td>Uttarahalli</td>\n",
"      <td>3 BHK</td>\n",
"      <td>NaN</td>\n",
"      <td>1440</td>\n",
"      <td>2.0</td>\n",
"      <td>3.0</td>\n",
"      <td>62.00</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>Super built-up  Area</td>\n",
"      <td>3 BHK</td>\n",
"      <td>Soiewre</td>\n",
"      <td>1521</td>\n",
"      <td>3.0</td>\n",
"      <td>1.0</td>\n",
"      <td>95.00</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>Super built-up  Area</td>\n",
"      <td>Kothanur</td>\n",
"      <td>2 BHK</td>\n",
"      <td>NaN</td>\n",
"      <td>1200</td>\n",
"      <td>2.0</td>\n",
"      <td>1.0</td>\n",
"      <td>51.00</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"              area_type   availability                  location       size  \\\n",
"0  Super built-up  Area         19-Dec  Electronic City Phase II      2 BHK   \n",
"1            Plot  Area  Ready To Move          Chikka Tirupathi  4 Bedroom   \n",
"2        Built-up  Area  Ready To Move               Uttarahalli      3 BHK   \n",
"4  Super built-up  Area  Ready To Move                  Kothanur      2 BHK   \n",
"\n",
"   society total_sqft  bath  balcony   price  \n",
"0  Coomee        1056   2.0      1.0   39.07  \n",
"1  Theanmp       2600   5.0      3.0  120.00  \n",
"2      NaN       1440   2.0      3.0   62.00  \n",
"3  Soiewre       1521   3.0      1.0   95.00  \n",
"4      NaN       1200   2.0      1.0   51.00  "
]
},
"execution_count": 74,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "code",
"execution_count": 75,
"outputs": [
{
"data": {
"text/plain": [
"area_type\n",
"Built-up  Area          2418\n",
"Carpet  Area              87\n",
"Plot  Area              2025\n",
"Super built-up  Area    8790\n",
"Name: area_type, dtype: int64"
]
},
"execution_count": 75,
"output_type": "execute_result"
}
],
"source": [
"df1.groupby('area_type')['area_type'].agg('count')"
]
},
{
"cell_type": "code",
"execution_count": 76,
"outputs": [],
"source": [
"df2 = df1.drop(['area_type','society','balcony','availability'] , axis=\"columns\")"
]
},
{
"cell_type": "code",
"execution_count": 77,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>Electronic City Phase II</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1056</td>\n",
"      <td>2.0</td>\n",
"      <td>39.07</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>Chikka Tirupathi</td>\n",
"      <td>4 Bedroom</td>\n",
"      <td>2600</td>\n",
"      <td>5.0</td>\n",
"      <td>120.00</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>Uttarahalli</td>\n",
"      <td>3 BHK</td>\n",
"      <td>1440</td>\n",
"      <td>2.0</td>\n",
"      <td>62.00</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>3 BHK</td>\n",
"      <td>1521</td>\n",
"      <td>3.0</td>\n",
"      <td>95.00</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>Kothanur</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1200</td>\n",
"      <td>2.0</td>\n",
"      <td>51.00</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"                   location       size total_sqft  bath   price\n",
"0  Electronic City Phase II      2 BHK       1056   2.0   39.07\n",
"1          Chikka Tirupathi  4 Bedroom       2600   5.0  120.00\n",
"2               Uttarahalli      3 BHK       1440   2.0   62.00\n",
"3        Lingadheeranahalli      3 BHK       1521   3.0   95.00\n",
"4                  Kothanur      2 BHK       1200   2.0   51.00"
]
},
"execution_count": 77,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "markdown",
"source": [
"### Data Cleaning: Handling NA/Null values"
]
},
{
"cell_type": "code",
"execution_count": 78,
"outputs": [
{
"data": {
"text/plain": [
"location       1\n",
"size          16\n",
"total_sqft     0\n",
"bath          73\n",
"price          0\n",
"dtype: int64"
]
},
"execution_count": 78,
"output_type": "execute_result"
}
],
"source": [
"df2.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 79,
"outputs": [
{
"data": {
"text/plain": [
"location      0\n",
"size          0\n",
"total_sqft    0\n",
"bath          0\n",
"price         0\n",
"dtype: int64"
]
},
"execution_count": 79,
"output_type": "execute_result"
}
],
"source": [
"df3 = df2.dropna()\n",
"df3.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 80,
"outputs": [
{
"data": {
"text/plain": [
"array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',\n",
"       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',\n",
"       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',\n",
"       '9 BHK', nan, '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',\n",
"       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',\n",
"       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)"
]
},
"execution_count": 80,
"output_type": "execute_result"
}
],
"source": [
"df2['size'].unique()"
]
},
{
"cell_type": "markdown",
"source": [
"### Feature Engineering"
]
},
{
"cell_type": "markdown",
"source": [
" Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself "
]
},
{
"cell_type": "code",
"execution_count": 81,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"<ipython-input-81-4c4c73fbe7f4>:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
"  df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))\n"
]
}
],
"source": [
"df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))"
]
},
{
"cell_type": "code",
"execution_count": 82,
"outputs": [
{
"data": {
"text/plain": [
"array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,\n",
"       13, 18], dtype=int64)"
]
},
"execution_count": 82,
"output_type": "execute_result"
}
],
"source": [
"df3['bhk'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 83,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>1718</th>\n",
"      <td>2Electronic City Phase II</td>\n",
"      <td>27 BHK</td>\n",
"      <td>8000</td>\n",
"      <td>27.0</td>\n",
"      <td>230.0</td>\n",
"      <td>27</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4684</th>\n",
"      <td>Munnekollal</td>\n",
"      <td>43 Bedroom</td>\n",
"      <td>2400</td>\n",
"      <td>40.0</td>\n",
"      <td>660.0</td>\n",
"      <td>43</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"                       location        size total_sqft  bath  price  bhk\n",
"1718  2Electronic City Phase II      27 BHK       8000  27.0  230.0   27\n",
"4684                Munnekollal  43 Bedroom       2400  40.0  660.0   43"
]
},
"execution_count": 83,
"output_type": "execute_result"
}
],
"source": [
"df3[df3.bhk>20]"
]
},
{
"cell_type": "code",
"execution_count": 84,
"outputs": [
{
"data": {
"text/plain": [
"array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],\n",
"      dtype=object)"
]
},
"execution_count": 84,
"output_type": "execute_result"
}
],
"source": [
"df3.total_sqft.unique()"
]
},
{
"cell_type": "code",
"execution_count": 85,
"outputs": [],
"source": [
"def is_float(x):\n",
"    try:\n",
"        float(x)\n",
"    except:\n",
"        return False\n",
"    return True"
]
},
{
"cell_type": "code",
"execution_count": 86,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>30</th>\n",
"      <td>Yelahanka</td>\n",
"      <td>4 BHK</td>\n",
"      <td>2100 - 2850</td>\n",
"      <td>4.0</td>\n",
"      <td>186.000</td>\n",
"      <td>4</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>122</th>\n",
"      <td>Hebbal</td>\n",
"      <td>4 BHK</td>\n",
"      <td>3067 - 8156</td>\n",
"      <td>4.0</td>\n",
"      <td>477.000</td>\n",
"      <td>4</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>137</th>\n",
"      <td>8th Phase JP Nagar</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1042 - 1105</td>\n",
"      <td>2.0</td>\n",
"      <td>54.005</td>\n",
"      <td>2</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>165</th>\n",
"      <td>Sarjapur</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1145 - 1340</td>\n",
"      <td>2.0</td>\n",
"      <td>43.490</td>\n",
"      <td>2</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>188</th>\n",
"      <td>KR Puram</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1015 - 1540</td>\n",
"      <td>2.0</td>\n",
"      <td>56.800</td>\n",
"      <td>2</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>410</th>\n",
"      <td>Kengeri</td>\n",
"      <td>1 BHK</td>\n",
"      <td>34.46Sq. Meter</td>\n",
"      <td>1.0</td>\n",
"      <td>18.500</td>\n",
"      <td>1</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>549</th>\n",
"      <td>2 BHK</td>\n",
"      <td>1195 - 1440</td>\n",
"      <td>2.0</td>\n",
"      <td>63.770</td>\n",
"      <td>2</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>648</th>\n",
"      <td>Arekere</td>\n",
"      <td>9 Bedroom</td>\n",
"      <td>4125Perch</td>\n",
"      <td>9.0</td>\n",
"      <td>265.000</td>\n",
"      <td>9</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>661</th>\n",
"      <td>Yelahanka</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1120 - 1145</td>\n",
"      <td>2.0</td>\n",
"      <td>48.130</td>\n",
"      <td>2</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>672</th>\n",
"      <td>Bettahalsoor</td>\n",
"      <td>4 Bedroom</td>\n",
"      <td>3090 - 5002</td>\n",
"      <td>4.0</td>\n",
"      <td>445.000</td>\n",
"      <td>4</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"               location       size      total_sqft  bath    price  bhk\n",
"30            Yelahanka      4 BHK     2100 - 2850   4.0  186.000    4\n",
"122              Hebbal      4 BHK     3067 - 8156   4.0  477.000    4\n",
"137  8th Phase JP Nagar      2 BHK     1042 - 1105   2.0   54.005    2\n",
"165            Sarjapur      2 BHK     1145 - 1340   2.0   43.490    2\n",
"188            KR Puram      2 BHK     1015 - 1540   2.0   56.800    2\n",
"410             Kengeri      1 BHK  34.46Sq. Meter   1.0   18.500    1\n",
"549         Hennur Road      2 BHK     1195 - 1440   2.0   63.770    2\n",
"648             Arekere  9 Bedroom       4125Perch   9.0  265.000    9\n",
"661           Yelahanka      2 BHK     1120 - 1145   2.0   48.130    2\n",
"672        Bettahalsoor  4 Bedroom     3090 - 5002   4.0  445.000    4"
]
},
"execution_count": 86,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "markdown",
"source": [
"##### Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. I am going to just drop such corner cases to keep things simple "
]
},
{
"cell_type": "code",
"execution_count": 87,
"outputs": [],
"source": [
"def convert_sqft_to_num(x):\n",
"    tokens = x.split('-')\n",
"    if len(tokens) == 2:\n",
"        return (float(tokens[0]) + float(tokens[1]))/2\n",
"    try:\n",
"        return float(x)\n",
"    except:\n",
"        return None\n",
"        "
]
},
{
"cell_type": "code",
"execution_count": 88,
"outputs": [],
"source": [
"df4 = df3.copy()\n",
"df4['total_sqft'] = df4['total_sqft'].apply(convert_sqft_to_num)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"outputs": [],
"source": [
"df5 = df4.copy()"
]
},
{
"cell_type": "markdown",
"source": [
"##### Add new feature called price per square feet"
]
},
{
"cell_type": "code",
"execution_count": 90,
"outputs": [],
"source": [
"df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']"
]
},
{
"cell_type": "code",
"execution_count": 91,
"outputs": [
{
"data": {
"text/plain": [
"1304"
]
},
"execution_count": 91,
"output_type": "execute_result"
}
],
"source": [
"len(df5.location.unique())"
]
},
{
"cell_type": "markdown",
"source": [
"####  Examine locations which is a categorical variable. We need to apply dimensionality reduction technique here to reduce number of locations"
]
},
{
"cell_type": "code",
"execution_count": 92,
"outputs": [],
"source": [
"df5.location = df5.location.apply(lambda x: x.strip())"
]
},
{
"cell_type": "code",
"execution_count": 93,
"outputs": [],
"source": [
"location_stats = df5.groupby('location')['location'].agg('count').sort_values(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 94,
"outputs": [
{
"data": {
"text/plain": [
"location\n",
"Whitefield           535\n",
"Electronic City      304\n",
"Thanisandra          236\n",
"                    ... \n",
"LIC Colony             1\n",
"Kuvempu Layout         1\n",
"Kumbhena Agrahara      1\n",
"Kudlu Village,         1\n",
"1 Annasandrapalya      1\n",
"Name: location, Length: 1293, dtype: int64"
]
},
"execution_count": 94,
"output_type": "execute_result"
}
],
"source": [
"location_stats"
]
},
{
"cell_type": "code",
"execution_count": 95,
"outputs": [
{
"data": {
"text/plain": [
"1052"
]
},
"execution_count": 95,
"output_type": "execute_result"
}
],
"source": [
"len(location_stats[location_stats<=10])"
]
},
{
"cell_type": "markdown",
"source": [
"## Dimensionality Reduction"
]
},
{
"cell_type": "markdown",
"source": [
"#### Any location having less than 10 data points should be tagged as \"other\" location. This way number of categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with having fewer dummy columns"
]
},
{
"cell_type": "code",
"execution_count": 96,
"outputs": [],
"source": [
"location_stats_less_10 = location_stats[location_stats<=10]"
]
},
{
"cell_type": "code",
"execution_count": 97,
"outputs": [
{
"data": {
"text/plain": [
"location\n",
"BTM 1st Stage          10\n",
"Basapura               10\n",
"Sector 1 HSR Layout    10\n",
"Naganathapura          10\n",
"Kalkere                10\n",
"                       ..\n",
"LIC Colony              1\n",
"Kuvempu Layout          1\n",
"Kumbhena Agrahara       1\n",
"Kudlu Village,          1\n",
"1 Annasandrapalya       1\n",
"Name: location, Length: 1052, dtype: int64"
]
},
"execution_count": 97,
"output_type": "execute_result"
}
],
"source": [
"location_stats_less_10"
]
},
{
"cell_type": "code",
"execution_count": 98,
"outputs": [
{
"data": {
"text/plain": [
"1293"
]
},
"execution_count": 98,
"output_type": "execute_result"
}
],
"source": [
"len(df5.location.unique())"
]
},
{
"cell_type": "code",
"execution_count": 99,
"outputs": [],
"source": [
"df5.location = df5.location.apply(lambda x: 'other' if x in location_stats_less_10 else x)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"outputs": [
{
"data": {
"text/plain": [
"242"
]
},
"execution_count": 100,
"output_type": "execute_result"
}
],
"source": [
"len(df5.location.unique())"
]
},
{
"cell_type": "code",
"execution_count": 101,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"      <th>price_per_sqft</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>Electronic City Phase II</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1056.0</td>\n",
"      <td>2.0</td>\n",
"      <td>39.07</td>\n",
"      <td>2</td>\n",
"      <td>3699.810606</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>Chikka Tirupathi</td>\n",
"      <td>4 Bedroom</td>\n",
"      <td>2600.0</td>\n",
"      <td>5.0</td>\n",
"      <td>120.00</td>\n",
"      <td>4</td>\n",
"      <td>4615.384615</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>Uttarahalli</td>\n",
"      <td>3 BHK</td>\n",
"      <td>1440.0</td>\n",
"      <td>2.0</td>\n",
"      <td>62.00</td>\n",
"      <td>3</td>\n",
"      <td>4305.555556</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>3 BHK</td>\n",
"      <td>1521.0</td>\n",
"      <td>3.0</td>\n",
"      <td>95.00</td>\n",
"      <td>3</td>\n",
"      <td>6245.890861</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>Kothanur</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1200.0</td>\n",
"      <td>2.0</td>\n",
"      <td>51.00</td>\n",
"      <td>2</td>\n",
"      <td>4250.000000</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"                   location       size  total_sqft  bath   price  bhk  \\\n",
"0  Electronic City Phase II      2 BHK      1056.0   2.0   39.07    2   \n",
"1          Chikka Tirupathi  4 Bedroom      2600.0   5.0  120.00    4   \n",
"2               Uttarahalli      3 BHK      1440.0   2.0   62.00    3   \n",
"3        Lingadheeranahalli      3 BHK      1521.0   3.0   95.00    3   \n",
"4                  Kothanur      2 BHK      1200.0   2.0   51.00    2   \n",
"\n",
"   price_per_sqft  \n",
"0     3699.810606  \n",
"1     4615.384615  \n",
"2     4305.555556  \n",
"3     6245.890861  \n",
"4     4250.000000  "
]
},
"execution_count": 101,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "markdown",
"source": [
"## Outlier Removal Using Business Logic"
]
},
{
"cell_type": "code",
"execution_count": 102,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"      <th>price_per_sqft</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>9</th>\n",
"      <td>other</td>\n",
"      <td>6 Bedroom</td>\n",
"      <td>1020.0</td>\n",
"      <td>6.0</td>\n",
"      <td>370.0</td>\n",
"      <td>6</td>\n",
"      <td>36274.509804</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>45</th>\n",
"      <td>HSR Layout</td>\n",
"      <td>8 Bedroom</td>\n",
"      <td>600.0</td>\n",
"      <td>9.0</td>\n",
"      <td>200.0</td>\n",
"      <td>8</td>\n",
"      <td>33333.333333</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>58</th>\n",
"      <td>Murugeshpalya</td>\n",
"      <td>6 Bedroom</td>\n",
"      <td>1407.0</td>\n",
"      <td>4.0</td>\n",
"      <td>150.0</td>\n",
"      <td>6</td>\n",
"      <td>10660.980810</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>68</th>\n",
"      <td>Devarachikkanahalli</td>\n",
"      <td>8 Bedroom</td>\n",
"      <td>1350.0</td>\n",
"      <td>7.0</td>\n",
"      <td>85.0</td>\n",
"      <td>8</td>\n",
"      <td>6296.296296</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>70</th>\n",
"      <td>other</td>\n",
"      <td>3 Bedroom</td>\n",
"      <td>500.0</td>\n",
"      <td>3.0</td>\n",
"      <td>100.0</td>\n",
"      <td>3</td>\n",
"      <td>20000.000000</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"               location       size  total_sqft  bath  price  bhk  \\\n",
"9                 other  6 Bedroom      1020.0   6.0  370.0    6   \n",
"45           HSR Layout  8 Bedroom       600.0   9.0  200.0    8   \n",
"58        Murugeshpalya  6 Bedroom      1407.0   4.0  150.0    6   \n",
"68  Devarachikkanahalli  8 Bedroom      1350.0   7.0   85.0    8   \n",
"70                other  3 Bedroom       500.0   3.0  100.0    3   \n",
"\n",
"    price_per_sqft  \n",
"9     36274.509804  \n",
"45    33333.333333  \n",
"58    10660.980810  \n",
"68     6296.296296  \n",
"70    20000.000000  "
]
},
"execution_count": 102,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "markdown",
"source": [
"##### Check above data points. We have 6 bhk apartment with 1020 sqft. Another one is 8 bhk and total sqft is 600. These are clear data errors that can be removed safely"
]
},
{
"cell_type": "code",
"execution_count": 103,
"outputs": [
{
"data": {
"text/plain": [
"(13246, 7)"
]
},
"execution_count": 103,
"output_type": "execute_result"
}
],
"source": [
"df5.shape"
]
},
{
"cell_type": "code",
"execution_count": 104,
"outputs": [],
"source": [
"df6 = df5[~(df5.total_sqft/df5.bhk<300)]"
]
},
{
"cell_type": "code",
"execution_count": 105,
"outputs": [
{
"data": {
"text/plain": [
"(12502, 7)"
]
},
"execution_count": 105,
"output_type": "execute_result"
}
],
"source": [
"df6.shape"
]
},
{
"cell_type": "markdown",
"source": [
"### Outlier Removal Using Standard Deviation and Mean"
]
},
{
"cell_type": "code",
"execution_count": 106,
"outputs": [
{
"data": {
"text/plain": [
"count     12456.000000\n",
"mean       6308.502826\n",
"std        4168.127339\n",
"min         267.829813\n",
"25%        4210.526316\n",
"50%        5294.117647\n",
"75%        6916.666667\n",
"max      176470.588235\n",
"Name: price_per_sqft, dtype: float64"
]
},
"execution_count": 106,
"output_type": "execute_result"
}
],
"source": [
"df6.price_per_sqft.describe()"
]
},
{
"cell_type": "code",
"execution_count": 107,
"outputs": [],
"source": [
"def remove_pps_outliers(df):\n",
"    df_out = pd.DataFrame()\n",
"    for key, subdf in df.groupby('location'):\n",
"        m = np.mean(subdf.price_per_sqft)\n",
"        st = np.std(subdf.price_per_sqft)\n",
"        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]\n",
"        df_out = pd.concat([df_out,reduced_df],ignore_index=True)\n",
"    return df_out    "
]
},
{
"cell_type": "code",
"execution_count": 108,
"outputs": [],
"source": [
"df7 = remove_pps_outliers(df6)"
]
},
{
"cell_type": "code",
"execution_count": 109,
"outputs": [
{
"data": {
"text/plain": [
"(10241, 7)"
]
},
"execution_count": 109,
"output_type": "execute_result"
}
],
"source": [
"df7.shape"
]
},
{
"cell_type": "code",
"execution_count": 110,
"outputs": [
{
"data": {
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def plot_scatter_chart(df,location):\n",
"    bhk2 = df[(df.location==location) & (df.bhk==2)]\n",
"    bhk3 = df[(df.location==location) & (df.bhk==3)]\n",
"    matplotlib.rcParams['figure.figsize'] = (15,10)\n",
"    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50)\n",
"    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)\n",
"    plt.xlabel(\"Total Square Feet Area\")\n",
"    plt.ylabel(\"Price (Lakh Indian Rupees)\")\n",
"    plt.title(location)\n",
"    plt.legend()\n",
"    \n",
"plot_scatter_chart(df7,\"Rajaji Nagar\")"
]
},
{
"cell_type": "code",
"execution_count": 111,
"outputs": [
{
"data": {
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_scatter_chart(df7,\"Hebbal\")"
]
},
{
"cell_type": "code",
"execution_count": 112,
"outputs": [],
"source": [
"def remove_bhk_outliers(df):\n",
"    exclude_indices = np.array([])\n",
"    for location,location_df in df.groupby('location'):\n",
"        bhk_stats = {}\n",
"        for bhk,bhk_df in location_df.groupby('bhk'):\n",
"            bhk_stats[bhk] = {\n",
"                'mean':np.mean(bhk_df.price_per_sqft),\n",
"                'std':np.std(bhk_df.price_per_sqft),\n",
"                'count':bhk_df.shape[0]\n",
"            }\n",
"        for bhk,bhk_df in location_df.groupby(\"bhk\"):\n",
"            stats = bhk_stats.get(bhk-1)\n",
"            if stats and stats['count']>5:\n",
"                exclude_indices = np.append(exclude_indices,bhk_df[bhk_df.price_per_sqft < (stats['mean'])].index.values)\n",
"    return df.drop(exclude_indices,axis='index')       "
]
},
{
"cell_type": "code",
"execution_count": 113,
"outputs": [],
"source": [
"df8 = remove_bhk_outliers(df7)"
]
},
{
"cell_type": "code",
"execution_count": 114,
"outputs": [
{
"data": {
"text/plain": [
"(7329, 7)"
]
},
"execution_count": 114,
"output_type": "execute_result"
}
],
"source": [
"df8.shape"
]
},
{
"cell_type": "markdown",
"source": [
"### Outlier Removal Using Bathrooms Feature"
]
},
{
"cell_type": "code",
"execution_count": 115,
"outputs": [
{
"data": {
"text/plain": [
"array([ 4.,  3.,  2.,  5.,  8.,  1.,  6.,  7.,  9., 12., 16., 13.])"
]
},
"execution_count": 115,
"output_type": "execute_result"
}
],
"source": [
"df8.bath.unique()"
]
},
{
"cell_type": "code",
"execution_count": 116,
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'Count')"
]
},
"execution_count": 116,
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.hist(df8.bath,rwidth=0.8)\n",
"plt.xlabel(\"Number of bathrooms\")\n",
"plt.ylabel(\"Count\")"
]
},
{
"cell_type": "code",
"execution_count": 117,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"      <th>price_per_sqft</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>5277</th>\n",
"      <td>10 BHK</td>\n",
"      <td>4000.0</td>\n",
"      <td>12.0</td>\n",
"      <td>160.0</td>\n",
"      <td>10</td>\n",
"      <td>4000.000000</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>8486</th>\n",
"      <td>other</td>\n",
"      <td>10 BHK</td>\n",
"      <td>12000.0</td>\n",
"      <td>12.0</td>\n",
"      <td>525.0</td>\n",
"      <td>10</td>\n",
"      <td>4375.000000</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>8575</th>\n",
"      <td>other</td>\n",
"      <td>16 BHK</td>\n",
"      <td>10000.0</td>\n",
"      <td>16.0</td>\n",
"      <td>550.0</td>\n",
"      <td>16</td>\n",
"      <td>5500.000000</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>9308</th>\n",
"      <td>other</td>\n",
"      <td>11 BHK</td>\n",
"      <td>6000.0</td>\n",
"      <td>12.0</td>\n",
"      <td>150.0</td>\n",
"      <td>11</td>\n",
"      <td>2500.000000</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>9639</th>\n",
"      <td>other</td>\n",
"      <td>13 BHK</td>\n",
"      <td>5425.0</td>\n",
"      <td>13.0</td>\n",
"      <td>275.0</td>\n",
"      <td>13</td>\n",
"      <td>5069.124424</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"            location    size  total_sqft  bath  price  bhk  price_per_sqft\n",
"5277  Neeladri Nagar  10 BHK      4000.0  12.0  160.0   10     4000.000000\n",
"8486           other  10 BHK     12000.0  12.0  525.0   10     4375.000000\n",
"8575           other  16 BHK     10000.0  16.0  550.0   16     5500.000000\n",
"9308           other  11 BHK      6000.0  12.0  150.0   11     2500.000000\n",
"9639           other  13 BHK      5425.0  13.0  275.0   13     5069.124424"
]
},
"execution_count": 117,
"output_type": "execute_result"
}
],
"source": [
"df8[df8.bath>10]"
]
},
{
"cell_type": "markdown",
"source": [
"#####  It is unusual to have 2 more bathrooms than number of bedrooms in a home"
]
},
{
"cell_type": "code",
"execution_count": 118,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"      <th>price_per_sqft</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>1626</th>\n",
"      <td>Chikkabanavar</td>\n",
"      <td>4 Bedroom</td>\n",
"      <td>2460.0</td>\n",
"      <td>7.0</td>\n",
"      <td>80.0</td>\n",
"      <td>4</td>\n",
"      <td>3252.032520</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>5238</th>\n",
"      <td>Nagasandra</td>\n",
"      <td>4 Bedroom</td>\n",
"      <td>7000.0</td>\n",
"      <td>8.0</td>\n",
"      <td>450.0</td>\n",
"      <td>4</td>\n",
"      <td>6428.571429</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>6711</th>\n",
"      <td>Thanisandra</td>\n",
"      <td>3 BHK</td>\n",
"      <td>1806.0</td>\n",
"      <td>6.0</td>\n",
"      <td>116.0</td>\n",
"      <td>3</td>\n",
"      <td>6423.034330</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>8411</th>\n",
"      <td>other</td>\n",
"      <td>6 BHK</td>\n",
"      <td>11338.0</td>\n",
"      <td>9.0</td>\n",
"      <td>1000.0</td>\n",
"      <td>6</td>\n",
"      <td>8819.897689</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"           location       size  total_sqft  bath   price  bhk  price_per_sqft\n",
"1626  Chikkabanavar  4 Bedroom      2460.0   7.0    80.0    4     3252.032520\n",
"5238     Nagasandra  4 Bedroom      7000.0   8.0   450.0    4     6428.571429\n",
"6711    Thanisandra      3 BHK      1806.0   6.0   116.0    3     6423.034330\n",
"8411          other      6 BHK     11338.0   9.0  1000.0    6     8819.897689"
]
},
"execution_count": 118,
"output_type": "execute_result"
}
],
"source": [
"df8[df8.bath>df8.bhk + 2]"
]
},
{
"cell_type": "code",
"execution_count": 119,
"outputs": [],
"source": [
"df9 = df8[df8.bath < df8.bhk + 2]"
]
},
{
"cell_type": "code",
"execution_count": 120,
"outputs": [
{
"data": {
"text/plain": [
"(7251, 7)"
]
},
"execution_count": 120,
"output_type": "execute_result"
}
],
"source": [
"df9.shape"
]
},
{
"cell_type": "code",
"execution_count": 121,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>size</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"      <th>price_per_sqft</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>4 BHK</td>\n",
"      <td>2850.0</td>\n",
"      <td>4.0</td>\n",
"      <td>428.0</td>\n",
"      <td>4</td>\n",
"      <td>15017.543860</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>3 BHK</td>\n",
"      <td>1630.0</td>\n",
"      <td>3.0</td>\n",
"      <td>194.0</td>\n",
"      <td>3</td>\n",
"      <td>11901.840491</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>3 BHK</td>\n",
"      <td>1875.0</td>\n",
"      <td>2.0</td>\n",
"      <td>235.0</td>\n",
"      <td>3</td>\n",
"      <td>12533.333333</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>3 BHK</td>\n",
"      <td>1200.0</td>\n",
"      <td>2.0</td>\n",
"      <td>130.0</td>\n",
"      <td>3</td>\n",
"      <td>10833.333333</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1235.0</td>\n",
"      <td>2.0</td>\n",
"      <td>148.0</td>\n",
"      <td>2</td>\n",
"      <td>11983.805668</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>...</th>\n",
"      <td>...</td>\n",
"      <td>...</td>\n",
"      <td>...</td>\n",
"      <td>...</td>\n",
"      <td>...</td>\n",
"      <td>...</td>\n",
"      <td>...</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>10232</th>\n",
"      <td>other</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1200.0</td>\n",
"      <td>2.0</td>\n",
"      <td>70.0</td>\n",
"      <td>2</td>\n",
"      <td>5833.333333</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>10233</th>\n",
"      <td>other</td>\n",
"      <td>1 BHK</td>\n",
"      <td>1800.0</td>\n",
"      <td>1.0</td>\n",
"      <td>200.0</td>\n",
"      <td>1</td>\n",
"      <td>11111.111111</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>10236</th>\n",
"      <td>other</td>\n",
"      <td>2 BHK</td>\n",
"      <td>1353.0</td>\n",
"      <td>2.0</td>\n",
"      <td>110.0</td>\n",
"      <td>2</td>\n",
"      <td>8130.081301</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>10237</th>\n",
"      <td>other</td>\n",
"      <td>1 Bedroom</td>\n",
"      <td>812.0</td>\n",
"      <td>1.0</td>\n",
"      <td>26.0</td>\n",
"      <td>1</td>\n",
"      <td>3201.970443</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>10240</th>\n",
"      <td>other</td>\n",
"      <td>4 BHK</td>\n",
"      <td>3600.0</td>\n",
"      <td>5.0</td>\n",
"      <td>400.0</td>\n",
"      <td>4</td>\n",
"      <td>11111.111111</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"<p>7251 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
"                  location       size  total_sqft  bath  price  bhk  \\\n",
"0      1st Block Jayanagar      4 BHK      2850.0   4.0  428.0    4   \n",
"1      1st Block Jayanagar      3 BHK      1630.0   3.0  194.0    3   \n",
"2      1st Block Jayanagar      3 BHK      1875.0   2.0  235.0    3   \n",
"3      1st Block Jayanagar      3 BHK      1200.0   2.0  130.0    3   \n",
"4      1st Block Jayanagar      2 BHK      1235.0   2.0  148.0    2   \n",
"...                    ...        ...         ...   ...    ...  ...   \n",
"10232                other      2 BHK      1200.0   2.0   70.0    2   \n",
"10233                other      1 BHK      1800.0   1.0  200.0    1   \n",
"10236                other      2 BHK      1353.0   2.0  110.0    2   \n",
"10237                other  1 Bedroom       812.0   1.0   26.0    1   \n",
"10240                other      4 BHK      3600.0   5.0  400.0    4   \n",
"\n",
"       price_per_sqft  \n",
"0        15017.543860  \n",
"1        11901.840491  \n",
"2        12533.333333  \n",
"3        10833.333333  \n",
"4        11983.805668  \n",
"...               ...  \n",
"10232     5833.333333  \n",
"10233    11111.111111  \n",
"10236     8130.081301  \n",
"10237     3201.970443  \n",
"10240    11111.111111  \n",
"\n",
"[7251 rows x 7 columns]"
]
},
"execution_count": 121,
"output_type": "execute_result"
}
],
"source": [
"df9"
]
},
{
"cell_type": "code",
"execution_count": 122,
"outputs": [],
"source": [
"df10 = df9.drop(['size','price_per_sqft'],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 123,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>2850.0</td>\n",
"      <td>4.0</td>\n",
"      <td>428.0</td>\n",
"      <td>4</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1630.0</td>\n",
"      <td>3.0</td>\n",
"      <td>194.0</td>\n",
"      <td>3</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1875.0</td>\n",
"      <td>2.0</td>\n",
"      <td>235.0</td>\n",
"      <td>3</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1200.0</td>\n",
"      <td>2.0</td>\n",
"      <td>130.0</td>\n",
"      <td>3</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1235.0</td>\n",
"      <td>2.0</td>\n",
"      <td>148.0</td>\n",
"      <td>2</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"              location  total_sqft  bath  price  bhk\n",
"0  1st Block Jayanagar      2850.0   4.0  428.0    4\n",
"1  1st Block Jayanagar      1630.0   3.0  194.0    3\n",
"2  1st Block Jayanagar      1875.0   2.0  235.0    3\n",
"3  1st Block Jayanagar      1200.0   2.0  130.0    3\n",
"4  1st Block Jayanagar      1235.0   2.0  148.0    2"
]
},
"execution_count": 123,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "markdown",
"source": [
"### Using One Hot Encoding For Location"
]
},
{
"cell_type": "markdown",
"source": [
"###### For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value."
]
},
{
"cell_type": "code",
"execution_count": 124,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>1st Block Jayanagar</th>\n",
"      <th>1st Phase JP Nagar</th>\n",
"      <th>2nd Phase Judicial Layout</th>\n",
"      <th>2nd Stage Nagarbhavi</th>\n",
"      <th>5th Block Hbr Layout</th>\n",
"      <th>5th Phase JP Nagar</th>\n",
"      <th>6th Phase JP Nagar</th>\n",
"      <th>7th Phase JP Nagar</th>\n",
"      <th>8th Phase JP Nagar</th>\n",
"      <th>9th Phase JP Nagar</th>\n",
"      <th>...</th>\n",
"      <th>Vishveshwarya Layout</th>\n",
"      <th>Vishwapriya Layout</th>\n",
"      <th>Vittasandra</th>\n",
"      <th>Whitefield</th>\n",
"      <th>Yelachenahalli</th>\n",
"      <th>Yelahanka</th>\n",
"      <th>Yelahanka New Town</th>\n",
"      <th>Yelenahalli</th>\n",
"      <th>Yeshwanthpur</th>\n",
"      <th>other</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"<p>5 rows × 242 columns</p>\n",
"</div>"
],
"text/plain": [
"   1st Block Jayanagar  1st Phase JP Nagar  2nd Phase Judicial Layout  \\\n",
"0                    1                   0                          0   \n",
"1                    1                   0                          0   \n",
"2                    1                   0                          0   \n",
"3                    1                   0                          0   \n",
"4                    1                   0                          0   \n",
"\n",
"   2nd Stage Nagarbhavi  5th Block Hbr Layout  5th Phase JP Nagar  \\\n",
"0                     0                     0                   0   \n",
"1                     0                     0                   0   \n",
"2                     0                     0                   0   \n",
"3                     0                     0                   0   \n",
"4                     0                     0                   0   \n",
"\n",
"   6th Phase JP Nagar  7th Phase JP Nagar  8th Phase JP Nagar  \\\n",
"0                   0                   0                   0   \n",
"1                   0                   0                   0   \n",
"2                   0                   0                   0   \n",
"3                   0                   0                   0   \n",
"4                   0                   0                   0   \n",
"\n",
"   9th Phase JP Nagar  ...  Vishveshwarya Layout  Vishwapriya Layout  \\\n",
"0                   0  ...                     0                   0   \n",
"1                   0  ...                     0                   0   \n",
"2                   0  ...                     0                   0   \n",
"3                   0  ...                     0                   0   \n",
"4                   0  ...                     0                   0   \n",
"\n",
"   Vittasandra  Whitefield  Yelachenahalli  Yelahanka  Yelahanka New Town  \\\n",
"0            0           0               0          0                   0   \n",
"1            0           0               0          0                   0   \n",
"2            0           0               0          0                   0   \n",
"3            0           0               0          0                   0   \n",
"4            0           0               0          0                   0   \n",
"\n",
"   Yelenahalli  Yeshwanthpur  other  \n",
"0            0             0      0  \n",
"1            0             0      0  \n",
"2            0             0      0  \n",
"3            0             0      0  \n",
"4            0             0      0  \n",
"\n",
"[5 rows x 242 columns]"
]
},
"execution_count": 124,
"output_type": "execute_result"
}
],
"source": [
"dummies = pd.get_dummies(df10.location)\n",
]
},
{
"cell_type": "code",
"execution_count": 125,
"outputs": [],
"source": [
"df11 = pd.concat([df10,dummies],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 126,
"outputs": [],
"source": [
"df11 = df11.drop(['other'],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 127,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>location</th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"      <th>1st Block Jayanagar</th>\n",
"      <th>1st Phase JP Nagar</th>\n",
"      <th>2nd Phase Judicial Layout</th>\n",
"      <th>2nd Stage Nagarbhavi</th>\n",
"      <th>5th Block Hbr Layout</th>\n",
"      <th>...</th>\n",
"      <th>Vijayanagar</th>\n",
"      <th>Vishveshwarya Layout</th>\n",
"      <th>Vishwapriya Layout</th>\n",
"      <th>Vittasandra</th>\n",
"      <th>Whitefield</th>\n",
"      <th>Yelachenahalli</th>\n",
"      <th>Yelahanka</th>\n",
"      <th>Yelahanka New Town</th>\n",
"      <th>Yelenahalli</th>\n",
"      <th>Yeshwanthpur</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>2850.0</td>\n",
"      <td>4.0</td>\n",
"      <td>428.0</td>\n",
"      <td>4</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1630.0</td>\n",
"      <td>3.0</td>\n",
"      <td>194.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1875.0</td>\n",
"      <td>2.0</td>\n",
"      <td>235.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1200.0</td>\n",
"      <td>2.0</td>\n",
"      <td>130.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>1st Block Jayanagar</td>\n",
"      <td>1235.0</td>\n",
"      <td>2.0</td>\n",
"      <td>148.0</td>\n",
"      <td>2</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"<p>5 rows × 246 columns</p>\n",
"</div>"
],
"text/plain": [
"              location  total_sqft  bath  price  bhk  1st Block Jayanagar  \\\n",
"0  1st Block Jayanagar      2850.0   4.0  428.0    4                    1   \n",
"1  1st Block Jayanagar      1630.0   3.0  194.0    3                    1   \n",
"2  1st Block Jayanagar      1875.0   2.0  235.0    3                    1   \n",
"3  1st Block Jayanagar      1200.0   2.0  130.0    3                    1   \n",
"4  1st Block Jayanagar      1235.0   2.0  148.0    2                    1   \n",
"\n",
"   1st Phase JP Nagar  2nd Phase Judicial Layout  2nd Stage Nagarbhavi  \\\n",
"0                   0                          0                     0   \n",
"1                   0                          0                     0   \n",
"2                   0                          0                     0   \n",
"3                   0                          0                     0   \n",
"4                   0                          0                     0   \n",
"\n",
"   5th Block Hbr Layout  ...  Vijayanagar  Vishveshwarya Layout  \\\n",
"0                     0  ...            0                     0   \n",
"1                     0  ...            0                     0   \n",
"2                     0  ...            0                     0   \n",
"3                     0  ...            0                     0   \n",
"4                     0  ...            0                     0   \n",
"\n",
"   Vishwapriya Layout  Vittasandra  Whitefield  Yelachenahalli  Yelahanka  \\\n",
"0                   0            0           0               0          0   \n",
"1                   0            0           0               0          0   \n",
"2                   0            0           0               0          0   \n",
"3                   0            0           0               0          0   \n",
"4                   0            0           0               0          0   \n",
"\n",
"   Yelahanka New Town  Yelenahalli  Yeshwanthpur  \n",
"0                   0            0             0  \n",
"1                   0            0             0  \n",
"2                   0            0             0  \n",
"3                   0            0             0  \n",
"4                   0            0             0  \n",
"\n",
"[5 rows x 246 columns]"
]
},
"execution_count": 127,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "code",
"execution_count": 128,
"outputs": [],
"source": [
"df12 = df11.drop(['location'],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 129,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>price</th>\n",
"      <th>bhk</th>\n",
"      <th>1st Block Jayanagar</th>\n",
"      <th>1st Phase JP Nagar</th>\n",
"      <th>2nd Phase Judicial Layout</th>\n",
"      <th>2nd Stage Nagarbhavi</th>\n",
"      <th>5th Block Hbr Layout</th>\n",
"      <th>5th Phase JP Nagar</th>\n",
"      <th>...</th>\n",
"      <th>Vijayanagar</th>\n",
"      <th>Vishveshwarya Layout</th>\n",
"      <th>Vishwapriya Layout</th>\n",
"      <th>Vittasandra</th>\n",
"      <th>Whitefield</th>\n",
"      <th>Yelachenahalli</th>\n",
"      <th>Yelahanka</th>\n",
"      <th>Yelahanka New Town</th>\n",
"      <th>Yelenahalli</th>\n",
"      <th>Yeshwanthpur</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>2850.0</td>\n",
"      <td>4.0</td>\n",
"      <td>428.0</td>\n",
"      <td>4</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>1630.0</td>\n",
"      <td>3.0</td>\n",
"      <td>194.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>1875.0</td>\n",
"      <td>2.0</td>\n",
"      <td>235.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>1200.0</td>\n",
"      <td>2.0</td>\n",
"      <td>130.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>1235.0</td>\n",
"      <td>2.0</td>\n",
"      <td>148.0</td>\n",
"      <td>2</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"<p>5 rows × 245 columns</p>\n",
"</div>"
],
"text/plain": [
"   total_sqft  bath  price  bhk  1st Block Jayanagar  1st Phase JP Nagar  \\\n",
"0      2850.0   4.0  428.0    4                    1                   0   \n",
"1      1630.0   3.0  194.0    3                    1                   0   \n",
"2      1875.0   2.0  235.0    3                    1                   0   \n",
"3      1200.0   2.0  130.0    3                    1                   0   \n",
"4      1235.0   2.0  148.0    2                    1                   0   \n",
"\n",
"   2nd Phase Judicial Layout  2nd Stage Nagarbhavi  5th Block Hbr Layout  \\\n",
"0                          0                     0                     0   \n",
"1                          0                     0                     0   \n",
"2                          0                     0                     0   \n",
"3                          0                     0                     0   \n",
"4                          0                     0                     0   \n",
"\n",
"   5th Phase JP Nagar  ...  Vijayanagar  Vishveshwarya Layout  \\\n",
"0                   0  ...            0                     0   \n",
"1                   0  ...            0                     0   \n",
"2                   0  ...            0                     0   \n",
"3                   0  ...            0                     0   \n",
"4                   0  ...            0                     0   \n",
"\n",
"   Vishwapriya Layout  Vittasandra  Whitefield  Yelachenahalli  Yelahanka  \\\n",
"0                   0            0           0               0          0   \n",
"1                   0            0           0               0          0   \n",
"2                   0            0           0               0          0   \n",
"3                   0            0           0               0          0   \n",
"4                   0            0           0               0          0   \n",
"\n",
"   Yelahanka New Town  Yelenahalli  Yeshwanthpur  \n",
"0                   0            0             0  \n",
"1                   0            0             0  \n",
"2                   0            0             0  \n",
"3                   0            0             0  \n",
"4                   0            0             0  \n",
"\n",
"[5 rows x 245 columns]"
]
},
"execution_count": 129,
"output_type": "execute_result"
}
],
"source": [
]
},
{
"cell_type": "code",
"execution_count": 130,
"outputs": [
{
"data": {
"text/plain": [
"(7251, 245)"
]
},
"execution_count": 130,
"output_type": "execute_result"
}
],
"source": [
"df12.shape\n"
]
},
{
"cell_type": "code",
"execution_count": 131,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>total_sqft</th>\n",
"      <th>bath</th>\n",
"      <th>bhk</th>\n",
"      <th>1st Block Jayanagar</th>\n",
"      <th>1st Phase JP Nagar</th>\n",
"      <th>2nd Phase Judicial Layout</th>\n",
"      <th>2nd Stage Nagarbhavi</th>\n",
"      <th>5th Block Hbr Layout</th>\n",
"      <th>5th Phase JP Nagar</th>\n",
"      <th>6th Phase JP Nagar</th>\n",
"      <th>...</th>\n",
"      <th>Vijayanagar</th>\n",
"      <th>Vishveshwarya Layout</th>\n",
"      <th>Vishwapriya Layout</th>\n",
"      <th>Vittasandra</th>\n",
"      <th>Whitefield</th>\n",
"      <th>Yelachenahalli</th>\n",
"      <th>Yelahanka</th>\n",
"      <th>Yelahanka New Town</th>\n",
"      <th>Yelenahalli</th>\n",
"      <th>Yeshwanthpur</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>2850.0</td>\n",
"      <td>4.0</td>\n",
"      <td>4</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>1630.0</td>\n",
"      <td>3.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>1875.0</td>\n",
"      <td>2.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>1200.0</td>\n",
"      <td>2.0</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>1235.0</td>\n",
"      <td>2.0</td>\n",
"      <td>2</td>\n",
"      <td>1</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>...</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"      <td>0</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"<p>5 rows × 244 columns</p>\n",
"</div>"
],
"text/plain": [
"   total_sqft  bath  bhk  1st Block Jayanagar  1st Phase JP Nagar  \\\n",
"0      2850.0   4.0    4                    1                   0   \n",
"1      1630.0   3.0    3                    1                   0   \n",
"2      1875.0   2.0    3                    1                   0   \n",
"3      1200.0   2.0    3                    1                   0   \n",
"4      1235.0   2.0    2                    1                   0   \n",
"\n",
"   2nd Phase Judicial Layout  2nd Stage Nagarbhavi  5th Block Hbr Layout  \\\n",
"0                          0                     0                     0   \n",
"1                          0                     0                     0   \n",
"2                          0                     0                     0   \n",
"3                          0                     0                     0   \n",
"4                          0                     0                     0   \n",
"\n",
"   5th Phase JP Nagar  6th Phase JP Nagar  ...  Vijayanagar  \\\n",
"0                   0                   0  ...            0   \n",
"1                   0                   0  ...            0   \n",
"2                   0                   0  ...            0   \n",
"3                   0                   0  ...            0   \n",
"4                   0                   0  ...            0   \n",
"\n",
"   Vishveshwarya Layout  Vishwapriya Layout  Vittasandra  Whitefield  \\\n",
"0                     0                   0            0           0   \n",
"1                     0                   0            0           0   \n",
"2                     0                   0            0           0   \n",
"3                     0                   0            0           0   \n",
"4                     0                   0            0           0   \n",
"\n",
"   Yelachenahalli  Yelahanka  Yelahanka New Town  Yelenahalli  Yeshwanthpur  \n",
"0               0          0                   0            0             0  \n",
"1               0          0                   0            0             0  \n",
"2               0          0                   0            0             0  \n",
"3               0          0                   0            0             0  \n",
"4               0          0                   0            0             0  \n",
"\n",
"[5 rows x 244 columns]"
]
},
"execution_count": 131,
"output_type": "execute_result"
}
],
"source": [
"X = df12.drop('price',axis='columns')\n",
]
},
{
"cell_type": "code",
"execution_count": 132,
"outputs": [
{
"data": {
"text/plain": [
"0        428.0\n",
"1        194.0\n",
"2        235.0\n",
"3        130.0\n",
"4        148.0\n",
"         ...  \n",
"10232     70.0\n",
"10233    200.0\n",
"10236    110.0\n",
"10237     26.0\n",
"10240    400.0\n",
"Name: price, Length: 7251, dtype: float64"
]
},
"execution_count": 132,
"output_type": "execute_result"
}
],
"source": [
"y = df12.price\n",
"y"
]
},
{
"cell_type": "code",
"execution_count": 133,
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 10)"
]
},
{
"cell_type": "code",
"execution_count": 134,
"outputs": [
{
"data": {
"text/plain": [
"0.8452277697873348"
]
},
"execution_count": 134,
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"lr_clf = LinearRegression()\n",
"lr_clf.fit(X_train,y_train)\n",
"lr_clf.score(X_test,y_test)"
]
},
{
"cell_type": "markdown",
"source": [
"## Use K Fold cross validation to measure accuracy of our LinearRegression model"
]
},
{
"cell_type": "markdown",
"source": [
"In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time.\n",
"Always remember, a lower value of k is more biased, and hence undesirable. On the other hand, a higher value of K is less biased, but can suffer from large variability. It is important to know that a smaller value of k always takes us towards validation set approach, whereas a higher value of k leads to LOOCV approach."
]
},
{
"cell_type": "code",
"execution_count": 135,
"outputs": [
{
"data": {
"text/plain": [
"array([0.82430186, 0.77166234, 0.85089567, 0.80837764, 0.83653286])"
]
},
"execution_count": 135,
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import ShuffleSplit\n",
"from sklearn.model_selection import cross_val_score\n",
"cv = ShuffleSplit(n_splits = 5, test_size = 0.2, random_state = 0)\n",
"cross_val_score(LinearRegression(),X,y,cv=cv)"
]
},
{
"cell_type": "markdown",
"source": [
"## Find best model using GridSearchCV "
]
},
{
"cell_type": "code",
"execution_count": 136,
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>model</th>\n",
"      <th>best_score</th>\n",
"      <th>best_params</th>\n",
"    </tr>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>linear_regression</td>\n",
"      <td>0.818354</td>\n",
"      <td>{'normalize': False}</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>lasso</td>\n",
"      <td>0.687430</td>\n",
"      <td>{'alpha': 2, 'selection': 'random'}</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>decision_tree</td>\n",
"      <td>0.720273</td>\n",
"      <td>{'criterion': 'friedman_mse', 'splitter': 'best'}</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"               model  best_score  \\\n",
"0  linear_regression    0.818354   \n",
"1              lasso    0.687430   \n",
"2      decision_tree    0.720273   \n",
"\n",
"                                         best_params  \n",
"0                               {'normalize': False}  \n",
"1                {'alpha': 2, 'selection': 'random'}  \n",
"2  {'criterion': 'friedman_mse', 'splitter': 'best'}  "
]
},
"execution_count": 136,
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.linear_model import Lasso\n",
"from sklearn.tree import DecisionTreeRegressor\n",
"\n",
"def find_best_model_using_gridsearchcv(X,y):\n",
"    algos = {\n",
"        'linear_regression' : {\n",
"            'model': LinearRegression(),\n",
"            'params': {\n",
"                'normalize': [True, False]\n",
"            }\n",
"        },\n",
"        'lasso': {\n",
"            'model': Lasso(),\n",
"            'params': {\n",
"                'alpha': [1,2],\n",
"                'selection': ['random', 'cyclic']\n",
"            }\n",
"        },\n",
"        'decision_tree': {\n",
"            'model': DecisionTreeRegressor(),\n",
"            'params': {\n",
"                'criterion' : ['mse','friedman_mse'],\n",
"                'splitter': ['best','random']\n",
"            }\n",
"        }\n",
"    }\n",
"    scores = []\n",
"    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)\n",
"    for algo_name, config in algos.items():\n",
"        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)\n",
"        gs.fit(X,y)\n",
"        scores.append({\n",
"            'model': algo_name,\n",
"            'best_score': gs.best_score_,\n",
"            'best_params': gs.best_params_\n",
"        })\n",
"\n",
"    return pd.DataFrame(scores,columns=['model','best_score','best_params'])\n",
"\n",
"find_best_model_using_gridsearchcv(X,y)"
]
},
{
"cell_type": "markdown",
"source": [
"#### Based on above results we can say that LinearRegression gives the best score. Hence we will use that. "
]
},
{
"cell_type": "markdown",
"source": [
"#### Test the model for few properties "
]
},
{
"cell_type": "code",
"execution_count": 137,
"outputs": [],
"source": [
"def predict_price(location,sqft,bath,bhk):\n",
"    loc_index = np.where(X.columns==location)[0][0]\n",
"    \n",
"    x = np.zeros(len(X.columns))\n",
"    x[0] = sqft\n",
"    x[1] = bath\n",
"    x[2] = bhk\n",
"    if loc_index >= 0:\n",
"        x[loc_index] = 1\n",
"    return lr_clf.predict([x])[0]    \n",
"      "
]
},
{
"cell_type": "code",
"execution_count": 138,
"outputs": [
{
"data": {
"text/plain": [
"83.49904676591962"
]
},
"execution_count": 138,
"output_type": "execute_result"
}
],
"source": [
"predict_price('1st Phase JP Nagar',1000,2,2)"
]
},
{
"cell_type": "code",
"execution_count": 139,
"outputs": [
{
"data": {
"text/plain": [
"184.58430202040012"
]
},
"execution_count": 139,
"output_type": "execute_result"
}
],
"source": [
"predict_price('Indira Nagar',1000, 3, 3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": []
}
],
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
```

# Finding Best ML Algorithm for House Price Prediction using k Cross Validation and GridSearchCV.

In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work accurately for the real data. For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. For this purpose, we use the cross-validation technique.Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.

``````import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"]=(20,10)``````
``df1 = pd.read_csv("Bengaluru_House_Data.csv")``
``df1.head()``
area_type availability location size society total_sqft bath balcony price
0 Super built-up Area 19-Dec Electronic City Phase II 2 BHK Coomee 1056 2.0 1.0 39.07
1 Plot Area Ready To Move Chikka Tirupathi 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 Built-up Area Ready To Move Uttarahalli 3 BHK NaN 1440 2.0 3.0 62.00
3 Super built-up Area Ready To Move Lingadheeranahalli 3 BHK Soiewre 1521 3.0 1.0 95.00
4 Super built-up Area Ready To Move Kothanur 2 BHK NaN 1200 2.0 1.0 51.00
``df1.groupby('area_type')['area_type'].agg('count')``
```area_type
Built-up  Area          2418
Carpet  Area              87
Plot  Area              2025
Super built-up  Area    8790
Name: area_type, dtype: int64```
``df2 = df1.drop(['area_type','society','balcony','availability'] , axis="columns")``
``df2.head()``
location size total_sqft bath price
0 Electronic City Phase II 2 BHK 1056 2.0 39.07
1 Chikka Tirupathi 4 Bedroom 2600 5.0 120.00
2 Uttarahalli 3 BHK 1440 2.0 62.00
3 Lingadheeranahalli 3 BHK 1521 3.0 95.00
4 Kothanur 2 BHK 1200 2.0 51.00

### Data Cleaning: Handling NA/Null values

``df2.isnull().sum()``
```location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64```
``````df3 = df2.dropna()
df3.isnull().sum()``````
```location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64```
``df2['size'].unique()``
```array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
'1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
'7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
'9 BHK', nan, '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
'10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
'12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)```

### Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself

``df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))``
```&lt;ipython-input-81-4c4c73fbe7f4&gt;:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
```
``df3['bhk'].unique()``
```array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
13, 18], dtype=int64)```
``df3[df3.bhk&gt;20]``
location size total_sqft bath price bhk
1718 2Electronic City Phase II 27 BHK 8000 27.0 230.0 27
4684 Munnekollal 43 Bedroom 2400 40.0 660.0 43
``df3.total_sqft.unique()``
```array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
dtype=object)```
``````def is_float(x):
try:
float(x)
except:
return False
return True``````
``df3[~df3['total_sqft'].apply(is_float)].head(10)``
location size total_sqft bath price bhk
30 Yelahanka 4 BHK 2100 - 2850 4.0 186.000 4
122 Hebbal 4 BHK 3067 - 8156 4.0 477.000 4
137 8th Phase JP Nagar 2 BHK 1042 - 1105 2.0 54.005 2
165 Sarjapur 2 BHK 1145 - 1340 2.0 43.490 2
188 KR Puram 2 BHK 1015 - 1540 2.0 56.800 2
410 Kengeri 1 BHK 34.46Sq. Meter 1.0 18.500 1
549 Hennur Road 2 BHK 1195 - 1440 2.0 63.770 2
648 Arekere 9 Bedroom 4125Perch 9.0 265.000 9
661 Yelahanka 2 BHK 1120 - 1145 2.0 48.130 2
672 Bettahalsoor 4 Bedroom 3090 - 5002 4.0 445.000 4
##### Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. I am going to just drop such corner cases to keep things simple
``````def convert_sqft_to_num(x):
tokens = x.split('-')
if len(tokens) == 2:
return (float(tokens[0]) + float(tokens[1]))/2
try:
return float(x)
except:
return None
``````
``````df4 = df3.copy()
df4['total_sqft'] = df4['total_sqft'].apply(convert_sqft_to_num)``````
``df5 = df4.copy()``
##### Add new feature called price per square feet
``df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']``
``len(df5.location.unique())``
`1304`

#### Examine locations which is a categorical variable. We need to apply dimensionality reduction technique here to reduce number of locations

``df5.location = df5.location.apply(lambda x: x.strip())``
``location_stats = df5.groupby('location')['location'].agg('count').sort_values(ascending=False)``
``location_stats``
```location
Whitefield           535
Electronic City      304
Thanisandra          236
...
LIC Colony             1
Kuvempu Layout         1
Kumbhena Agrahara      1
Kudlu Village,         1
1 Annasandrapalya      1
Name: location, Length: 1293, dtype: int64```
``len(location_stats[location_stats&lt;=10])``
`1052`

## Dimensionality Reduction

#### Any location having less than 10 data points should be tagged as "other" location. This way number of categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with having fewer dummy columns

``location_stats_less_10 = location_stats[location_stats&lt;=10]``
``location_stats_less_10``
```location
BTM 1st Stage          10
Basapura               10
Sector 1 HSR Layout    10
Naganathapura          10
Kalkere                10
..
LIC Colony              1
Kuvempu Layout          1
Kumbhena Agrahara       1
Kudlu Village,          1
1 Annasandrapalya       1
Name: location, Length: 1052, dtype: int64```
``len(df5.location.unique())``
`1293`
``df5.location = df5.location.apply(lambda x: 'other' if x in location_stats_less_10 else x)``
``len(df5.location.unique())``
`242`
``df5.head()``
location size total_sqft bath price bhk price_per_sqft
0 Electronic City Phase II 2 BHK 1056.0 2.0 39.07 2 3699.810606
1 Chikka Tirupathi 4 Bedroom 2600.0 5.0 120.00 4 4615.384615
2 Uttarahalli 3 BHK 1440.0 2.0 62.00 3 4305.555556
3 Lingadheeranahalli 3 BHK 1521.0 3.0 95.00 3 6245.890861
4 Kothanur 2 BHK 1200.0 2.0 51.00 2 4250.000000

## Outlier Removal Using Business Logic

``df5[df5.total_sqft/df5.bhk&lt;300].head()``
location size total_sqft bath price bhk price_per_sqft
9 other 6 Bedroom 1020.0 6.0 370.0 6 36274.509804
45 HSR Layout 8 Bedroom 600.0 9.0 200.0 8 33333.333333
58 Murugeshpalya 6 Bedroom 1407.0 4.0 150.0 6 10660.980810
68 Devarachikkanahalli 8 Bedroom 1350.0 7.0 85.0 8 6296.296296
70 other 3 Bedroom 500.0 3.0 100.0 3 20000.000000
##### Check above data points. We have 6 bhk apartment with 1020 sqft. Another one is 8 bhk and total sqft is 600. These are clear data errors that can be removed safely
``df5.shape``
`(13246, 7)`
``df6 = df5[~(df5.total_sqft/df5.bhk&lt;300)]``
``df6.shape``
`(12502, 7)`

### Outlier Removal Using Standard Deviation and Mean

``df6.price_per_sqft.describe()``
```count     12456.000000
mean       6308.502826
std        4168.127339
min         267.829813
25%        4210.526316
50%        5294.117647
75%        6916.666667
max      176470.588235
Name: price_per_sqft, dtype: float64```
``````def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft&gt;(m-st)) & (subdf.price_per_sqft&lt;=(m+st))]
df_out = pd.concat([df_out,reduced_df],ignore_index=True)
return df_out    ``````
``df7 = remove_pps_outliers(df6)``
``df7.shape``
`(10241, 7)`
``````def plot_scatter_chart(df,location):
bhk2 = df[(df.location==location) & (df.bhk==2)]
bhk3 = df[(df.location==location) & (df.bhk==3)]
matplotlib.rcParams['figure.figsize'] = (15,10)
plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50)
plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)
plt.xlabel("Total Square Feet Area")
plt.ylabel("Price (Lakh Indian Rupees)")
plt.title(location)
plt.legend()

plot_scatter_chart(df7,"Rajaji Nagar")``````
``plot_scatter_chart(df7,"Hebbal")``
``````def remove_bhk_outliers(df):
exclude_indices = np.array([])
for location,location_df in df.groupby('location'):
bhk_stats = {}
for bhk,bhk_df in location_df.groupby('bhk'):
bhk_stats[bhk] = {
'mean':np.mean(bhk_df.price_per_sqft),
'std':np.std(bhk_df.price_per_sqft),
'count':bhk_df.shape[0]
}
for bhk,bhk_df in location_df.groupby("bhk"):
stats = bhk_stats.get(bhk-1)
if stats and stats['count']&gt;5:
exclude_indices = np.append(exclude_indices,bhk_df[bhk_df.price_per_sqft &lt; (stats['mean'])].index.values)
return df.drop(exclude_indices,axis='index')       ``````
``df8 = remove_bhk_outliers(df7)``
``df8.shape``
`(7329, 7)`

### Outlier Removal Using Bathrooms Feature

``df8.bath.unique()``
`array([ 4.,  3.,  2.,  5.,  8.,  1.,  6.,  7.,  9., 12., 16., 13.])`
``````plt.hist(df8.bath,rwidth=0.8)
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")``````
`Text(0, 0.5, 'Count')`
``df8[df8.bath&gt;10]``
location size total_sqft bath price bhk price_per_sqft
5277 Neeladri Nagar 10 BHK 4000.0 12.0 160.0 10 4000.000000
8486 other 10 BHK 12000.0 12.0 525.0 10 4375.000000
8575 other 16 BHK 10000.0 16.0 550.0 16 5500.000000
9308 other 11 BHK 6000.0 12.0 150.0 11 2500.000000
9639 other 13 BHK 5425.0 13.0 275.0 13 5069.124424
##### It is unusual to have 2 more bathrooms than number of bedrooms in a home
``df8[df8.bath&gt;df8.bhk + 2]``
location size total_sqft bath price bhk price_per_sqft
1626 Chikkabanavar 4 Bedroom 2460.0 7.0 80.0 4 3252.032520
5238 Nagasandra 4 Bedroom 7000.0 8.0 450.0 4 6428.571429
6711 Thanisandra 3 BHK 1806.0 6.0 116.0 3 6423.034330
8411 other 6 BHK 11338.0 9.0 1000.0 6 8819.897689
``df9 = df8[df8.bath &lt; df8.bhk + 2]``
``df9.shape``
`(7251, 7)`
``df9``
location size total_sqft bath price bhk price_per_sqft
0 1st Block Jayanagar 4 BHK 2850.0 4.0 428.0 4 15017.543860
1 1st Block Jayanagar 3 BHK 1630.0 3.0 194.0 3 11901.840491
2 1st Block Jayanagar 3 BHK 1875.0 2.0 235.0 3 12533.333333
3 1st Block Jayanagar 3 BHK 1200.0 2.0 130.0 3 10833.333333
4 1st Block Jayanagar 2 BHK 1235.0 2.0 148.0 2 11983.805668
... ... ... ... ... ... ... ...
10232 other 2 BHK 1200.0 2.0 70.0 2 5833.333333
10233 other 1 BHK 1800.0 1.0 200.0 1 11111.111111
10236 other 2 BHK 1353.0 2.0 110.0 2 8130.081301
10237 other 1 Bedroom 812.0 1.0 26.0 1 3201.970443
10240 other 4 BHK 3600.0 5.0 400.0 4 11111.111111

7251 rows × 7 columns

``df10 = df9.drop(['size','price_per_sqft'],axis = 'columns')``
``df10.head()``
location total_sqft bath price bhk
0 1st Block Jayanagar 2850.0 4.0 428.0 4
1 1st Block Jayanagar 1630.0 3.0 194.0 3
2 1st Block Jayanagar 1875.0 2.0 235.0 3
3 1st Block Jayanagar 1200.0 2.0 130.0 3
4 1st Block Jayanagar 1235.0 2.0 148.0 2

### Using One Hot Encoding For Location

###### For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.
``````dummies = pd.get_dummies(df10.location)
1st Block Jayanagar 1st Phase JP Nagar 2nd Phase Judicial Layout 2nd Stage Nagarbhavi 5th Block Hbr Layout 5th Phase JP Nagar 6th Phase JP Nagar 7th Phase JP Nagar 8th Phase JP Nagar 9th Phase JP Nagar ... Vishveshwarya Layout Vishwapriya Layout Vittasandra Whitefield Yelachenahalli Yelahanka Yelahanka New Town Yelenahalli Yeshwanthpur other
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 242 columns

``df11 = pd.concat([df10,dummies],axis = 'columns')``
``df11 = df11.drop(['other'],axis = 'columns')``
``df11.head()``
location total_sqft bath price bhk 1st Block Jayanagar 1st Phase JP Nagar 2nd Phase Judicial Layout 2nd Stage Nagarbhavi 5th Block Hbr Layout ... Vijayanagar Vishveshwarya Layout Vishwapriya Layout Vittasandra Whitefield Yelachenahalli Yelahanka Yelahanka New Town Yelenahalli Yeshwanthpur
0 1st Block Jayanagar 2850.0 4.0 428.0 4 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1st Block Jayanagar 1630.0 3.0 194.0 3 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1st Block Jayanagar 1875.0 2.0 235.0 3 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1st Block Jayanagar 1200.0 2.0 130.0 3 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1st Block Jayanagar 1235.0 2.0 148.0 2 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 246 columns

``df12 = df11.drop(['location'],axis = 'columns')``
``df12.head()``
total_sqft bath price bhk 1st Block Jayanagar 1st Phase JP Nagar 2nd Phase Judicial Layout 2nd Stage Nagarbhavi 5th Block Hbr Layout 5th Phase JP Nagar ... Vijayanagar Vishveshwarya Layout Vishwapriya Layout Vittasandra Whitefield Yelachenahalli Yelahanka Yelahanka New Town Yelenahalli Yeshwanthpur
0 2850.0 4.0 428.0 4 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1630.0 3.0 194.0 3 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1875.0 2.0 235.0 3 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1200.0 2.0 130.0 3 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1235.0 2.0 148.0 2 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 245 columns

``````df12.shape
``````
`(7251, 245)`
``````X = df12.drop('price',axis='columns')
total_sqft bath bhk 1st Block Jayanagar 1st Phase JP Nagar 2nd Phase Judicial Layout 2nd Stage Nagarbhavi 5th Block Hbr Layout 5th Phase JP Nagar 6th Phase JP Nagar ... Vijayanagar Vishveshwarya Layout Vishwapriya Layout Vittasandra Whitefield Yelachenahalli Yelahanka Yelahanka New Town Yelenahalli Yeshwanthpur
0 2850.0 4.0 4 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1630.0 3.0 3 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1875.0 2.0 3 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1200.0 2.0 3 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1235.0 2.0 2 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 244 columns

``````y = df12.price
y``````
```0        428.0
1        194.0
2        235.0
3        130.0
4        148.0
...
10232     70.0
10233    200.0
10236    110.0
10237     26.0
10240    400.0
Name: price, Length: 7251, dtype: float64```
``````from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 10)``````
``````from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)``````
`0.8452277697873348`

## Use K Fold cross validation to measure accuracy of our LinearRegression model

In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time. Always remember, a lower value of k is more biased, and hence undesirable. On the other hand, a higher value of K is less biased, but can suffer from large variability. It is important to know that a smaller value of k always takes us towards validation set approach, whereas a higher value of k leads to LOOCV approach.

``````from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
cv = ShuffleSplit(n_splits = 5, test_size = 0.2, random_state = 0)
cross_val_score(LinearRegression(),X,y,cv=cv)``````
`array([0.82430186, 0.77166234, 0.85089567, 0.80837764, 0.83653286])`

## Find best model using GridSearchCV

``````from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

def find_best_model_using_gridsearchcv(X,y):
algos = {
'linear_regression' : {
'model': LinearRegression(),
'params': {
'normalize': [True, False]
}
},
'lasso': {
'model': Lasso(),
'params': {
'alpha': [1,2],
'selection': ['random', 'cyclic']
}
},
'decision_tree': {
'model': DecisionTreeRegressor(),
'params': {
'criterion' : ['mse','friedman_mse'],
'splitter': ['best','random']
}
}
}
scores = []
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
for algo_name, config in algos.items():
gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
gs.fit(X,y)
scores.append({
'model': algo_name,
'best_score': gs.best_score_,
'best_params': gs.best_params_
})

return pd.DataFrame(scores,columns=['model','best_score','best_params'])

find_best_model_using_gridsearchcv(X,y)``````
model best_score best_params
0 linear_regression 0.818354 {'normalize': False}
1 lasso 0.687430 {'alpha': 2, 'selection': 'random'}
2 decision_tree 0.720273 {'criterion': 'friedman_mse', 'splitter': 'best'}

#### Test the model for few properties

``````def predict_price(location,sqft,bath,bhk):
loc_index = np.where(X.columns==location)[0][0]

x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index &gt;= 0:
x[loc_index] = 1
return lr_clf.predict([x])[0]
``````
``predict_price('1st Phase JP Nagar',1000,2,2)``
`83.49904676591962`
``predict_price('Indira Nagar',1000, 3, 3)``
`184.58430202040012`