Abstract
The
researcher and numerous other concerned parties have become engaged in the
process of growing or declining housing values. Linear regression has been used
extensively in previous research to answer the question of how housing prices
fluctuate. This study uses machine learning to categorize the problem of
changing home values as a classification task and forecast whether they will
rise or fall. Several methods for feature extraction, data cleaning and
principle component assessment, as well as data processing transformation
methods, including exception and missing value management as well as
numerical transformation methods, are used in this research. The four
criteria of accuracy, precision, f1-score, and support are used to assess how
well machine learning approaches function. Moreover, the study deploys a
classification algorithm as well as clustering algorithms. The data processing
allows driving clustering and classification based on the numerical
transformation of the data. The results of classification algorithms are
illustrated by the confusion matrix and classification reports.
Introduction
The cornerstone of
the rising need for homes is indeed the advancement of civilizations. Housing
market accuracy has always fascinated both sellers and purchasers as well as
banking. The uncertainties surrounding the forecasting of house prices have
already been the subject of numerous academics. As a result of the study
contributions made by numerous researchers from around the world, numerous
ideas have emerged. Several of these theories claim that a region's geography
and culture dictate whether or not housing values will rise or fall, while
other thought schools place more emphasis on the macroeconomic factors that
significantly contribute to these increases.[1]
We can all understand that a property's value is a figure drawn from a
predetermined range, hence predicting housing costs is a regressive issue. To
forecast house prices, one person often looks for comparable properties in the
neighbourhood and makes an educated guess based on the data gathered. All of
these factors point to predictive research being conducted on the forecasting
of home prices, which calls for machine learning expertise. I'm inspired to
pursue this career path because of this. Environment, economy, worldview, and
governance are the primary historical determinants. [2]
1.1.
Categorization of Houses for the Real estate agency
The main descriptions
can be used to categorise housing costs: urban, downtown, typical price, etc.):
Housing costs are inversely associated with distance from developed areas and
highly associated with levels of urban advancement. House prices and traffic
comfort have a strong correlation (subway, bus, road rating, etc.).[3]
Characteristics of the home (kind, year of construction, degree of
refurbishment, floor, etc.): In most cases, the price increases with the
quality of the housing. Supportive services: The actual quality and
accessibility of the packages have a beneficial impact on property values.
These accommodations include public amenities like parks, hospitals, and
shopping malls. A house's value may be proper monitoring and contrasted.
Personal choices and
lifestyles are the only variables that cannot be quantified but are frequently
the deciding factors. Comfort is, however, frequently difficult to quantify and
occasionally overlooked in data collected. Numerous factors influence a group's
cost of housing.
2.
Analysis of the data
2.1.
Preliminary analysis
The
study of examining patterns, trends, and correlations using statistical data is
known as statistical analysis. It is a prominent research instrument utilised
by academics, government organisations, corporations, and other institutions.
Figure 1. Statistical Measures of the data
After
gathering the data from our sampling, we may evaluate and summarise it using
inferential and descriptive statistics. Finally, we can test hypotheses and
offer population figures by using inferential analysis. [4]Finally, we
might assess and extrapolate your results. Technically analysis is often used
for relatively brief trading and focuses on data and information. The headlines
may represent the results of technical indicators due to their brief period. Figure
1 represents the statistical measures of the data, which includes the total
number of the values in the column, meaning along with the maximum and minimum
values. The 25%,50%,75% and a max of the values of the features.
2.2.
Correlation Testing
Since
the purpose of this research is to generate a precise predictive model through
statistical analysis, it is essential to have information which will make it
simple to train the model. as just a result, we used a variety of correlation
testing techniques, as shown in figures 2. To find the relevant and correlated
characteristics and characteristics for modelling and training the classifier,
correlation tests are run on the data. The extraction of features makes it
simple to employ only the characteristics, such as columns, that improve the
parameter sharing mechanism.[5]
Figure 2. Correlation Test i.e. Spearmans,
Crammer and Phik Tests
The
Pearson's r and Kendall's T-tests used to obtain the correlation matrix graphic
operate similarly to the spearman's correlation test shown in figure 2. These
tests illustrate the correlation score ranging from 0 to 1 in phik’s and
crammers' tests. whereas the spearman’s test ranges from -1 to 1.
2.3.
Statistical Representation of Sale Prices and Correlated
Features
The
figure 3 represents the plots highly correlated with the sales price. The house
for the real estate agency is categorized on the features that are plotted
against the values of sales of houses.[6]

Figure 3. Scatter plots of features highly
correlated with the Sale Price
The
rising values of the sales are due to the large pool area, fireplace, garage
area and the number of bedrooms and kitchens. Figure 4 illustrates that the
number of houses with higher values are built after the 20th century
whereas some houses from the 1880s to 1890s are also sold at a very higher rate
due to their uniqueness and architecture. The houses built in age 19th
century are sold at low rates without any spikes in the rising trends.

Figure 4. Number of Houses built yearly
Figure
5 represents that the houses with a large first-floor area are sold at high
rates in comparison to houses with a 2nd floor. The 2nd-floor
houses are less in size but have comparitative price in comparison of 1st
floor houses. Therefore the company should build 2nd-floor houses
with less area to produce more money, or built 1st-floor houses with
large area size.

Figure 5. Scatter plot of Houses with first
and second floor
3. Machine Learning Approach
We
can now use the data for prediction after it has been preprocessed and cleared
of any imprecise information. This part will go through the technique we
utilised to forecast the price of real estate as well as how the model will be
trained and tested.[7] The system aims
to deliver a system capable of predicting house prices using linear regression which
is the technique that we've chosen, in figure 6 the result for the predictor
variables is generated using a variety of independent variables. The strength
of the association between the variable to be forecasted and the other control
variables determines its value.

Figure 6. Results of Liner Regression
Model
The
model score of prediction was 78%, where there are 11 slop points with an
interception of -786016.6559. The regression graph figure
6 displays the predicted price and test data that were supplied to the model.
Green dots indicate the data points in which the variables intercept. The graph
shows the regression line as a blue line.
Table 1. Regression Model Characteristics
|
Model |
Score |
intercept
|
coefficient of
determination |
slope: |
|
Linear Regression |
0.78 |
-786016.65 |
0.78603713020093 |
11 |
4. Categorical Classification
To
classify homes according to their overall condition (OverallCond), convert
OverallCond into a categorical variable with three possible labels.



Figure 7. Classification Reports and
Confusion Matrix
If
the overall condition is between 1 and 3, it is considered to be poor. If it is
between 4 and 6, it is considered to be average. If the general condition is
between 7 and 10, that's good. Figure 7 represents the confusion matrix as well
as the classification report with scores.
5.
Clustering
The
k means nearest clustering algorithm was deployed for the clustering purposes
of the data. The elbow method was drived from the data, that generates the
graphical respresentation to determine k values.

Figure 8.
Elbow Method for clustering
The
k-means clustering was found not a better option to implement on the data for
the real estate agency because of the low accuracy rate.[4]
6. Discussion
In
this research, we present a technique for predicting estate rates in an area's
surrounding neighbourhoods. We cleaned up the information by removing any
erroneous information and outliers. The linear regression method was then
fitted to a subset of this dataset, which was then utilized to test the model
that was chosen. It is possible to develop this technique beyond forecasting
property prices in other Indian cities and rural regions. After adding features
to the system, such as trends in a certain place, comparisons with other
properties, etc., it may also be transformed into live web pages on the
internet. A method similar to ours can be created to forecast increases in the
property's price as well.
6.1.
Accuracies of the
models
Figure
7 illustrates the precision of classes to be predicted, recall values, f1
scores and support scores of the QDA classifier that is (0.83,0.9,0.87,342),
(0.5,0.3,0.4,87) and (0.66,0.22,0.33,9)for average, good and poor respectively.whereas,
the precision of classes to be predicted, recall values, f1 scores and support
scores of the LDA classifier that is (0.83,0.9,0.87,342), (0.5,0.37,0.43,87)
and (0.667,0.22,0.33,9)for average, good and poor respectively. The precision
of classes to be predicted, recall values, f1 scores and support scores of the
logistic regression classifier that is (0.83,0.918,0.87,342), (0.4,0.3,0.4,87)
and (0,0,0,9)for average, good and poor respectively. The overall accuracy to
predict accurate prediction for classification is 78.8%, 78.8% and 78.5% for qda, lda and logistic regression
respectively. The prediction model's accuracy was 78%, with 11 slop points and an
interception of -786016.6559. The anticipated price and test data that were
given to the model are shown in the regression graph in figure 6. The data
points where the variables intercept are denoted by green dots. The regression
line is displayed as a blue line on the graph.
7.
Conclusion
This
study uses machine learning to categorize the problem of changing home values
as a classification task and forecast whether they will rise or fall. The
vast pool area, fireplace, garage area, as well as the quantity of bedrooms and
kitchens, all contribute to the rising sales values. While many homes developed
after the 20th century have higher valuations, several homes from the 1880s and
1890s are also sold at highly premium prices because of their uniqueness and
construction. The prices of homes constructed in the 19th century are low, with
no sudden surges in the upward trend. The company should build 2nd-floor
houses with less area to produce more money, or built 1st-floor
houses with large area size. We removed any false data and outliers from the
information to clean it up. A portion of this dataset was then fitted using the
linear regression approach, and the resulting data was used to evaluate the
selected model. The prediction model's accuracy was 78%, with 11 slop points
and an interception of -786016.6559. For qda, lda, and
logistic regression, respectively, the total accuracy to forecast accurate
prediction for classification is 78.8%, 78.8%, and 78.5%. whereas, the k-means
clustering was found not a better option to implement on the data for the real
estate agency because of the low accuracy rate. The system is novel and
implements the factors affecting house prices on real estate agencies.
0 Comments