Building House Price Prediction Machine Learning Model

4 min readOct 5, 2020

In this blog, I am gonna share my experience of building a Machine Learning Model for Predicting Price for houses.

Machine learning plays a major role from past years in normal speech command, spam reorganization, image detection, product recommendation and medical diagnosis. Present Machine learning algorithm helps us in enhancing security alerts, ensuring public safety and improve medical enhancements. Machine learning system also provides better customer service and safer automobile systems. When I started experimenting with machine learning, I wanted to come up with an application that would solve a real-world problem but would not be too complicated to implement. I also wanted to practice working with regression algorithms.

People looking to buy a new home tend to be more conservative with their budgets and market strategies. The existing system involves calculation of house prices without the necessary prediction about future market trends and price increase. The goal of the paper is to predict the efficient house pricing for Bangalore real estate customers with respect to their budgets and priorities. By analyzing previous market trends and price ranges, and also upcoming developments future prices will be predicted. When I started experimenting with machine learning, I wanted to come up with an application that would solve a real-world problem but would not be too complicated to implement. I also wanted to practice working with regression algorithms.

Software:

So Frist of all required a data for show the price of houses so for that download the data from kaggle. Name of the data is Bangalore house price data.

Jupyter notebook or Google Colab

Anaconda Navigator-Spyder Anaconda3

Exploratory data analysis

Import the necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load dataset

Load csv file from local file

from google.colab import files
uploaded = files.upload()

In the output you should see (13320, 9) which means that our algorithm has 13320 rows and 9 columns.

Let’s see how the dataset actually looks. To do so, we can use the “head()” function as shown below:

Exploratory Data Analysis:

Than we have include our data from what we download from kaggle. we have to clean that data because in that data we have so many errors so after cleaning that data then we are ready to include that data in our website. In that data we have 3 type of numeric data and 6 type of objective data. In 3 type data we have maximum house price is 3600.000000 lakh rupee and minimum house price is 8.000000 rupee. Next we also have all price’s pairplots. Also we have correlation heatmap because of that we know correlation of bath is greater than balcony.

sns.pairplot(df)

Check the correlation between parameters

Now you need to check for strong correlations among given parameters. If there are, then remove one of the parameters. In my dataset there were no strong correlations among values.

Correlation Heatmap:

correlation of bath is greater than a balcony with price

Finding Outlier and Removing

here we observe outlier using histogram,, qq plot and boxplot,

Drop categorical variable of dataset:

Dataset preparation (Splitting & Scaling) and Model Selection and Evaluation

Then we can split the data in X and Y. Split dataset in train and test. There’s more than one way to do regression analysis. What we’re looking for is the best prediction accuracy given our data and train the model. Train the different different model,

Linear Regression
Lasso
Support vector machine
Random Forest
XGBoost

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used.

Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.

Regression accuracy of all model output of the script above looks like this:

As you can see, the Random forest regressor showed the best accuracy, so we decided to use this algorithm for production.