FOR SALE HOUSE ANALYSIS AND PREDICTION FOR REAL STATE AGENCY

 

Abstract

The researcher and numerous other concerned parties have become engaged in the process of growing or declining housing values. Linear regression has been used extensively in previous research to answer the question of how housing prices fluctuate. This study uses machine learning to categorize the problem of changing home values as a classification task and forecast whether they will rise or fall. Several methods for feature extraction, data cleaning and principle component assessment, as well as data processing transformation methods, including exception and missing value management as well as numerical transformation methods, are used in this research. The four criteria of accuracy, precision, f1-score, and support are used to assess how well machine learning approaches function. Moreover, the study deploys a classification algorithm as well as clustering algorithms. The data processing allows driving clustering and classification based on the numerical transformation of the data. The results of classification algorithms are illustrated by the confusion matrix and classification reports.

 

 

 

 

 

 

 

 

Introduction

The cornerstone of the rising need for homes is indeed the advancement of civilizations. Housing market accuracy has always fascinated both sellers and purchasers as well as banking. The uncertainties surrounding the forecasting of house prices have already been the subject of numerous academics. As a result of the study contributions made by numerous researchers from around the world, numerous ideas have emerged. Several of these theories claim that a region's geography and culture dictate whether or not housing values will rise or fall, while other thought schools place more emphasis on the macroeconomic factors that significantly contribute to these increases.[1] We can all understand that a property's value is a figure drawn from a predetermined range, hence predicting housing costs is a regressive issue. To forecast house prices, one person often looks for comparable properties in the neighbourhood and makes an educated guess based on the data gathered. All of these factors point to predictive research being conducted on the forecasting of home prices, which calls for machine learning expertise. I'm inspired to pursue this career path because of this. Environment, economy, worldview, and governance are the primary historical determinants. [2]

1.1.        Categorization of Houses for the Real estate agency

The main descriptions can be used to categorise housing costs: urban, downtown, typical price, etc.): Housing costs are inversely associated with distance from developed areas and highly associated with levels of urban advancement. House prices and traffic comfort have a strong correlation (subway, bus, road rating, etc.).[3] Characteristics of the home (kind, year of construction, degree of refurbishment, floor, etc.): In most cases, the price increases with the quality of the housing. Supportive services: The actual quality and accessibility of the packages have a beneficial impact on property values. These accommodations include public amenities like parks, hospitals, and shopping malls. A house's value may be proper monitoring and contrasted.

Personal choices and lifestyles are the only variables that cannot be quantified but are frequently the deciding factors. Comfort is, however, frequently difficult to quantify and occasionally overlooked in data collected. Numerous factors influence a group's cost of housing.

 

2.      Analysis of the data

2.1.        Preliminary analysis

The study of examining patterns, trends, and correlations using statistical data is known as statistical analysis. It is a prominent research instrument utilised by academics, government organisations, corporations, and other institutions.

Figure 1. Statistical Measures of the data

After gathering the data from our sampling, we may evaluate and summarise it using inferential and descriptive statistics. Finally, we can test hypotheses and offer population figures by using inferential analysis. [4]Finally, we might assess and extrapolate your results. Technically analysis is often used for relatively brief trading and focuses on data and information. The headlines may represent the results of technical indicators due to their brief period. Figure 1 represents the statistical measures of the data, which includes the total number of the values in the column, meaning along with the maximum and minimum values. The 25%,50%,75% and a max of the values of the features.

2.2.        Correlation Testing

Since the purpose of this research is to generate a precise predictive model through statistical analysis, it is essential to have information which will make it simple to train the model. as just a result, we used a variety of correlation testing techniques, as shown in figures 2. To find the relevant and correlated characteristics and characteristics for modelling and training the classifier, correlation tests are run on the data. The extraction of features makes it simple to employ only the characteristics, such as columns, that improve the parameter sharing mechanism.[5]

 


Figure 2. Correlation Test i.e. Spearmans, Crammer and Phik Tests

 

The Pearson's r and Kendall's T-tests used to obtain the correlation matrix graphic operate similarly to the spearman's correlation test shown in figure 2. These tests illustrate the correlation score ranging from 0 to 1 in phik’s and crammers' tests. whereas the spearman’s test ranges from -1 to 1.

2.3.        Statistical Representation of Sale Prices and Correlated Features

The figure 3 represents the plots highly correlated with the sales price. The house for the real estate agency is categorized on the features that are plotted against the values of sales of houses.[6]

Figure 3. Scatter plots of features highly correlated with the Sale Price

The rising values of the sales are due to the large pool area, fireplace, garage area and the number of bedrooms and kitchens. Figure 4 illustrates that the number of houses with higher values are built after the 20th century whereas some houses from the 1880s to 1890s are also sold at a very higher rate due to their uniqueness and architecture. The houses built in age 19th century are sold at low rates without any spikes in the rising trends.

Figure 4. Number of Houses built yearly

Figure 5 represents that the houses with a large first-floor area are sold at high rates in comparison to houses with a 2nd floor. The 2nd-floor houses are less in size but have comparitative price in comparison of 1st floor houses. Therefore the company should build 2nd-floor houses with less area to produce more money, or built 1st-floor houses with large area size.

Figure 5. Scatter plot of Houses with first and second floor

3.      Machine Learning Approach

We can now use the data for prediction after it has been preprocessed and cleared of any imprecise information. This part will go through the technique we utilised to forecast the price of real estate as well as how the model will be trained and tested.[7] The system aims to deliver a system capable of predicting house prices using linear regression which is the technique that we've chosen, in figure 6 the result for the predictor variables is generated using a variety of independent variables. The strength of the association between the variable to be forecasted and the other control variables determines its value. 

Figure 6. Results of Liner Regression Model

The model score of prediction was 78%, where there are 11 slop points with an interception of -786016.6559. The regression graph figure 6 displays the predicted price and test data that were supplied to the model. Green dots indicate the data points in which the variables intercept. The graph shows the regression line as a blue line.

Table 1. Regression Model Characteristics

Model

Score

intercept

 

coefficient of determination

slope:

Linear Regression

0.78

-786016.65
0.78603713020093

11

 

4.      Categorical Classification

To classify homes according to their overall condition (OverallCond), convert OverallCond into a categorical variable with three possible labels.

Figure 7. Classification Reports and Confusion Matrix

If the overall condition is between 1 and 3, it is considered to be poor. If it is between 4 and 6, it is considered to be average. If the general condition is between 7 and 10, that's good. Figure 7 represents the confusion matrix as well as the classification report with scores.

5.       Clustering

The k means nearest clustering algorithm was deployed for the clustering purposes of the data. The elbow method was drived from the data, that generates the graphical respresentation to determine k values.

Figure 8. Elbow Method for clustering

The k-means clustering was found not a better option to implement on the data for the real estate agency because of the low accuracy rate.[4]

6.      Discussion

In this research, we present a technique for predicting estate rates in an area's surrounding neighbourhoods. We cleaned up the information by removing any erroneous information and outliers. The linear regression method was then fitted to a subset of this dataset, which was then utilized to test the model that was chosen. It is possible to develop this technique beyond forecasting property prices in other Indian cities and rural regions. After adding features to the system, such as trends in a certain place, comparisons with other properties, etc., it may also be transformed into live web pages on the internet. A method similar to ours can be created to forecast increases in the property's price as well.

6.1.         Accuracies of the models

Figure 7 illustrates the precision of classes to be predicted, recall values, f1 scores and support scores of the QDA classifier that is (0.83,0.9,0.87,342), (0.5,0.3,0.4,87) and (0.66,0.22,0.33,9)for average, good and poor respectively.whereas, the precision of classes to be predicted, recall values, f1 scores and support scores of the LDA classifier that is (0.83,0.9,0.87,342), (0.5,0.37,0.43,87) and (0.667,0.22,0.33,9)for average, good and poor respectively. The precision of classes to be predicted, recall values, f1 scores and support scores of the logistic regression classifier that is (0.83,0.918,0.87,342), (0.4,0.3,0.4,87) and (0,0,0,9)for average, good and poor respectively. The overall accuracy to predict accurate prediction for classification is 78.8%, 78.8% and  78.5% for qda, lda and logistic regression respectively. The prediction model's accuracy was 78%, with 11 slop points and an interception of -786016.6559. The anticipated price and test data that were given to the model are shown in the regression graph in figure 6. The data points where the variables intercept are denoted by green dots. The regression line is displayed as a blue line on the graph.

7.       Conclusion

This study uses machine learning to categorize the problem of changing home values as a classification task and forecast whether they will rise or fall. The vast pool area, fireplace, garage area, as well as the quantity of bedrooms and kitchens, all contribute to the rising sales values. While many homes developed after the 20th century have higher valuations, several homes from the 1880s and 1890s are also sold at highly premium prices because of their uniqueness and construction. The prices of homes constructed in the 19th century are low, with no sudden surges in the upward trend. The company should build 2nd-floor houses with less area to produce more money, or built 1st-floor houses with large area size. We removed any false data and outliers from the information to clean it up. A portion of this dataset was then fitted using the linear regression approach, and the resulting data was used to evaluate the selected model. The prediction model's accuracy was 78%, with 11 slop points and an interception of -786016.6559. For qda, lda, and logistic regression, respectively, the total accuracy to forecast accurate prediction for classification is 78.8%, 78.8%, and 78.5%. whereas, the k-means clustering was found not a better option to implement on the data for the real estate agency because of the low accuracy rate. The system is novel and implements the factors affecting house prices on real estate agencies.

 

Post a Comment

0 Comments