Image Source: Zillow

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(sf)
library(data.table)
library(spdep)
library(caret)
library(dplyr)
library(ckanr)
library(MASS)
library(FNN)
library(viridis)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(tufte)
library(ggplot2)
library(rmarkdown)
library(stargazer)
library(splm)
library(kableExtra)
library(corrplot)
library(tidycensus)
library(FNN)
library(psych)
# functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
options(tigris_use_cache = TRUE)

# Colors
bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
blue2Palette5 <- c("#08519c","#3182bd","#6baed6","#bdd7e7","#eff3ff")
orangePalette5 <- c("#ffffb2","#fecc5c","#fd8d3c","#f03b20","#bd0026")
greyPalette5 <- c("#f7f7f7","#cccccc","#969696","#636363","#252525")
viridisPalette <- c("#fde725", "#5ec962", "#21918c", "#3b528b","#440154")

# LoadAPI
census_api_key("52c094b268b1f62d3845601acbd2ae784e75ccc9", overwrite = TRUE)

1. Introduction

This project aims to help Zillow produce stronger housing market predictions. While Zillow’s current model performs satisfactorily, this approach offers a different lens into how the model could be built stronger by modelling a range of internal and external housing factors, as well as other variables. This includes a mix of demographic variables, spatial correlations, internal housing features, and amenities of the model city, Philadelphia.

The result of our analysis is a home price prediction model based on a hypothesis of the impact of various internal and external factors as well as spatial clustering on sale value. The generalizability and accuracy of the model can further be explained through tests and measures of error/error percentage, explored below.

2. Data Analysis

2.1 Data Wrangling

For this project, we used home sale value data from Zillow. This dataset included important factors about each unit such as number of bedrooms, bathrooms, garage space, floor area, and other internal characteristics. Additionally, the model also borrowed data from OpenDataPhilly around crime, trees, Heat Index, Farmers Markets, and more. Finally, it collected data from the Census Bureau around education, poverty, race and population.

2.2 Summary Statistics

Through a series of summary statistics, we can determine which variables of these will be best suited for our model. The following three tables look at the summary statistics of internal characteristics, nearby amenities, and demographic data.

numericVars <- 
 select_if(st_drop_geometry(sd), is.numeric) %>% na.omit()

stargazer(numericVars, type = "text", title = "Summary Statistics by Internal Characteristics", summary = TRUE, digits = 2)

## 
## Summary Statistics by Internal Characteristics
## ===================================================================================
## Statistic            N        Mean          St. Dev.         Min           Max     
## -----------------------------------------------------------------------------------
## objectid            429  360,324,401.00    12,349.73     360,302,736   360,345,502 
## category_code       429       1.00            0.00            1             1      
## census_tract        429      129.51          109.76          15            388     
## depth               429      76.32           39.22            0            337     
## exempt_land         429      57.76           848.89           0          13,580    
## exterior_condition  429       3.05            1.25            1             7      
## fireplaces          429       0.13            0.49            0             5      
## frontage            429      22.54           27.28            0            388     
## garage_spaces       429       0.24            0.53            0             3      
## geographic_ward     429      30.53           14.02            1            65      
## house_number        429     2,335.85        2,234.71          3           9,218    
## interior_condition  429       2.64            1.18            1             7      
## number_of_bathrooms 429       1.77            0.76            1             4      
## number_of_bedrooms  429       3.14            0.83            1             8      
## number_of_rooms     429       6.53            1.63            4            15      
## number_stories      429       2.29            0.52            1             3      
## off_street_open     429     1,892.79        1,385.27         118          8,330    
## parcel_number       429  312,587,766.00  148,309,143.00  11,020,200    888,210,597 
## sale_price          429    391,861.80      304,197.40         0         3,150,000  
## street_code         429    54,943.27       23,246.32       11,160        89,390    
## total_area          429     2,238.17        4,881.59          0          57,987    
## total_livable_area  429     1,565.13         752.21          784          6,562    
## year_built          429     1,935.16         31.35          1,856         2,023    
## zip_code            429    19,138.19         11.07         19,114        19,153    
## pin                 429 1,001,366,978.00   169,449.80   1,001,051,076 1,001,682,780
## sale_year           429     2,022.26          0.44          2,022         2,023    
## musaID              429    12,452.80        6,976.42         93          23,693    
## TotalPop            429     4,190.82        1,510.95        1,027         9,047    
## MedHHInc            429    66,712.56       23,595.44       14,798        144,531   
## MedRent             429     1,086.22         258.56          441          2,339    
## pctWhite            429       0.43            0.29         0.0003         1.12     
## pctBachelors        429       0.02            0.02          0.00          0.12     
## pctPoverty          429       0.17            0.11          0.01          0.53     
## -----------------------------------------------------------------------------------

#filter to numeric values only
numericVars_sdknn <- 
 select_if(st_drop_geometry(sdknn), is.numeric) %>% na.omit()
#table
stargazer(numericVars_sdknn, type = "text", title = "Summary Statistics of Nearby Amenities", summary = TRUE, digits = 2)

## 
## Summary Statistics of Nearby Amenities
## ====================================================================
## Statistic     N         Mean       St. Dev.      Min         Max    
## --------------------------------------------------------------------
## objectid    23,881 360,323,428.00 12,257.32  360,302,636 360,352,508
## sale_price  23,881   271,846.80   240,083.40      0       5,990,000 
## crimebuff   23,881     124.32       61.15         1          274    
## crime_nn1   23,881     55.66        33.19       7.63       611.50   
## crime_nn2   23,881     67.01        36.65       11.71      611.97   
## crime_nn3   23,881     78.34        40.18       16.34      798.31   
## crime_nn4   23,881     88.39        43.69       20.69      893.84   
## crime_nn5   23,881     97.56        47.06       24.76      952.43   
## medbuff     23,881      0.81         1.06         0           5     
## med_nn1     23,881    1,610.20     1,065.29     24.22     7,948.94  
## med_nn2     23,881    1,945.62     1,145.59     77.40     8,222.91  
## med_nn3     23,881    2,250.18     1,193.67    219.05     8,516.00  
## med_nn4     23,881    2,598.14     1,281.80    510.08     8,844.52  
## med_nn5     23,881    2,878.34     1,369.47    664.10     9,139.81  
## HVI_SCORE   23,881     -0.33         4.15       -8.35       8.17    
## schoolbuff  23,881      3.05         2.13         0          14     
## school_nn1  23,881     349.98       197.48      9.89      1,799.39  
## school_nn2  23,881     447.53       208.29      46.57     2,291.26  
## school_nn3  23,881     531.39       226.94      68.48     2,490.30  
## school_nn4  23,881     607.32       245.87     125.38     2,612.69  
## school_nn5  23,881     676.50       266.53     172.15     2,729.48  
## playbuff    23,881      7.90         9.07         0          45     
## play_nn1    23,881     788.40      1,148.03     8.93      6,502.55  
## play_nn2    23,881     971.65      1,422.61     11.65     8,388.83  
## play_nn3    23,881    1,067.08     1,554.41     11.65     9,261.06  
## play_nn4    23,881    1,135.92     1,634.49     16.73     9,727.01  
## play_nn5    23,881    1,191.68     1,687.84     43.40     10,049.93 
## grocerbuff  23,881     16.74         6.30         2          35     
## fmarketbuff 23,881      1.55         1.56         0           8     
## fmarket_nn1 23,881    1,758.53     1,510.20     23.82     9,311.52  
## fmarket_nn2 23,881    2,208.60     1,726.27    177.92     10,644.13 
## fmarket_nn3 23,881    2,622.36     1,868.10    240.85     11,683.51 
## fmarket_nn4 23,881    3,180.40     2,222.16    601.64     13,237.11 
## fmarket_nn5 23,881    3,604.04     2,466.98    844.08     14,291.91 
## treebuff    23,881     939.55       761.23       12         4,103   
## --------------------------------------------------------------------

# join sd and neighborhoods 
joined_nhoods_sd <- st_join(nhoods, sd, join = st_intersects)
#filter to numeric values only
numericVars_nhoods <- 
 select_if(st_drop_geometry(joined_nhoods_sd), is.numeric) %>% na.omit()
#table
stargazer(numericVars_nhoods, type = "text", title = "Summary Statistics of Demographic Variables", summary = TRUE, digits = 2)

## 
## Summary Statistics of Demographic Variables
## ===================================================================================
## Statistic            N        Mean          St. Dev.         Min           Max     
## -----------------------------------------------------------------------------------
## TotalPop.x          429     4,190.82        1,510.95        1,027         9,047    
## MedHHInc.x          429    66,712.56       23,595.44       14,798        144,531   
## MedRent.x           429     1,086.22         258.56          441          2,339    
## pctWhite.x          429       0.43            0.29         0.0003         1.12     
## pctBachelors.x      429       0.02            0.02          0.00          0.12     
## pctPoverty.x        429       0.17            0.11          0.01          0.53     
## objectid            429  360,324,401.00    12,349.73     360,302,736   360,345,502 
## category_code       429       1.00            0.00            1             1      
## census_tract        429      129.51          109.76          15            388     
## depth               429      76.32           39.22            0            337     
## exempt_land         429      57.76           848.89           0          13,580    
## exterior_condition  429       3.05            1.25            1             7      
## fireplaces          429       0.13            0.49            0             5      
## frontage            429      22.54           27.28            0            388     
## garage_spaces       429       0.24            0.53            0             3      
## geographic_ward     429      30.53           14.02            1            65      
## house_number        429     2,335.85        2,234.71          3           9,218    
## interior_condition  429       2.64            1.18            1             7      
## number_of_bathrooms 429       1.77            0.76            1             4      
## number_of_bedrooms  429       3.14            0.83            1             8      
## number_of_rooms     429       6.53            1.63            4            15      
## number_stories      429       2.29            0.52            1             3      
## off_street_open     429     1,892.79        1,385.27         118          8,330    
## parcel_number       429  312,587,766.00  148,309,143.00  11,020,200    888,210,597 
## sale_price          429    391,861.80      304,197.40         0         3,150,000  
## street_code         429    54,943.27       23,246.32       11,160        89,390    
## total_area          429     2,238.17        4,881.59          0          57,987    
## total_livable_area  429     1,565.13         752.21          784          6,562    
## year_built          429     1,935.16         31.35          1,856         2,023    
## zip_code            429    19,138.19         11.07         19,114        19,153    
## pin                 429 1,001,366,978.00   169,449.80   1,001,051,076 1,001,682,780
## sale_year           429     2,022.26          0.44          2,022         2,023    
## musaID              429    12,452.80        6,976.42         93          23,693    
## TotalPop.y          429     4,190.82        1,510.95        1,027         9,047    
## MedHHInc.y          429    66,712.56       23,595.44       14,798        144,531   
## MedRent.y           429     1,086.22         258.56          441          2,339    
## pctWhite.y          429       0.43            0.29         0.0003         1.12     
## pctBachelors.y      429       0.02            0.02          0.00          0.12     
## pctPoverty.y        429       0.17            0.11          0.01          0.53     
## -----------------------------------------------------------------------------------

2.3 Correlation Matrix

This section evaluates which of the factors that we chose are highly related to one another, and hence will not prove to be useful for the model. In the matrix below, we can see that factors like street code and pincode are correlated, in addition to number of bed rooms and rooms, ward and parcel number, and lastly frontage and total area. In demographic factors, percent poverty and median rent and median household income seem to be correlated. For this reason, we will consider the following factors and leave out the rest: Exterior and interiors conditions, garage space, number of bedrooms, number of bathrooms, number of stories, fireplace, tyoe of view, year of housing built, and total built area.

ggcorrplot(
 round(cor(numericVars_nhoods), 1), 
 p.mat = cor_pmat(numericVars_nhoods),
 colors = c("blue", "white", "red"),
 type="lower",
 insig = "blank",
 tl.cex = 6) +  
 labs(title = "Correlation across Different Variables") +
 mapTheme()

2.4 Home Price Correlation Matrix Scatterplots

Plot 1: Total area versus Sales Price As the total area of the unit increases, so does the price. This may indicate that places with larger houses, typically away from dense settlements might have higher sales prices.

Plot 2: Year built versus sales price Newer construction is definitely more valued. Houses built post 1950s bring values of more than a million dollars. However, there seem to be many outliers that make it difficult to generalize.

Plot 3: Number of bedrooms versus sales price

Looks like sales price increases as the number of bedrooms increases. This might be because of simply more space means more prices but also because after a certain number of bedrooms, prices increases indepently too.

Plot 4: Trees in buffer versus sales price This plot definitely shows a high slope, or that sales prices go further up in neighborhoods or tracts with high number of trees.

css1 <- ggplot(numericVars, aes(x = sale_price, y = total_area)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot1: House Area",
       x = "Price",
       y = "House Area") +
  theme_minimal()


css2 <- ggplot(numericVars, aes(x = sale_price, y = year_built)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot2: Year Built",
       x = "Price",
       y = "Year house was built") +
  theme_minimal()

#css2

css3 <- ggplot(numericVars, aes(x = sale_price, y = number_of_bedrooms)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot3 : Number of Bedrooms",
       x = "Price",
       y = "Bedrooms") +
  theme_minimal()

#css3


css4 <- ggplot(numericVars_sdknn, aes(x = sale_price, y = treebuff)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot4 : Trees in buffer",
       x = "Price",
       y = "Number of trees in a buffer") +
  theme_minimal()

#css4

grid.arrange(css1, css2, css3, css4)

2.5 Mapping Home Price

The following map visualizes the home sale price dataset. The census tract geometry was chosen for this visualization to look at how nodes in Philadephia differ in mean home value. According to this map, the most expensive areas are in Center City, University City, and northwest Philadelphia.

# for age_house
sd <- sd %>%
  mutate(age = ifelse(year_built > 1000, 2021 - year_built, 0))

# grouping by census tract
sd_ct <- sd %>%
  group_by(census_tract) %>%
 summarise(
    sale_price = mean(sale_price, na.rm = TRUE),
    total_area = mean(total_area, na.rm = TRUE), 
    age = mean(age, na.rm = TRUE)
  )  %>%
  rename(NAME10 = census_tract)

sd_ct$NAME10 <- as.character(as.numeric(sd_ct$NAME10))

#joined_data <- left_join(sd_ct, ct, by = "NAME10")
joined_data <- st_join(nhoods, sd_ct, join = st_intersects)

joined_data <- joined_data[is.finite(joined_data$sale_price), ]

joined_data <- joined_data[is.finite(joined_data$total_area), ]

joined_data <- joined_data[is.finite(joined_data$age), ]

ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = sale_price), color = "#5D4954", size = 0.3) +
  scale_fill_viridis_c(option = "viridis", guide = "legend",
                       name = "Sale Price", 
                       breaks = seq(min(joined_data$sale_price), max(joined_data$sale_price), length.out = 5)) +
  labs(title = "Home Sale Prices by Census Tract",
       subtitle = "Philadelphia",
       caption = "  ") +
  mapTheme()

2.6 Mapping Independent Variables

Map 1: Total area This map shows the size of houses by area. It seems like most houses in the city have about 100 square footage but the ones in the West of the city are super big.

Map 2: Age This map shows home age by census tracts and it seems like most houses are close to a 100 years old but newer construction has happened in the west and north parts of the city.

Map 3: Median Household Income This map shows median household income and it seems like the western and northern parts of the city have comparatively higher median household income. This will be useful in looking at observed home pricing and help predict home pricing as well.

Map 4: Median Rent This map shows median house rent, similar to the household income map, the western and northern parts of the city have higher rents.

Map 5: Percent White Population This map has percent white population. There is a higher percent white population as we move away from the city.

# Livable area
map1 <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = total_area), color = "#5D4954", size = 0.3) +
  scale_fill_viridis_c(option = "viridis", guide = "legend",
                       name = "Area", 
                       breaks = seq(min(joined_data$total_area), max(joined_data$total_area), length.out = 5)) +
  labs(title = "Home Total Area by Census Tract",
       subtitle = "Total areas for houses in Philly",
       caption = "") +
  mapTheme()

# Age
map2 <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = age), color = "#5D4954", size = 0.3) +
  scale_fill_viridis_c(option = "viridis", guide = "legend",
                       name = "Age", 
                       breaks = seq(min(joined_data$age), max(joined_data$age), length.out = 5)) +
  labs(title = "Home Age by Census Tract",
       subtitle = "Age of housing in Philly",
       caption = "") +
  mapTheme()

# Median Household Income
map3 <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = MedHHInc), color = "#5D4954", size = 0.3) +
  scale_fill_viridis_c(option = "viridis", guide = "legend",
                       name = "Median Income in Dollars") +
  labs(title = "Median Household Income by Census Tract",
       subtitle = "Household income in Philly",
       caption = "") +
  mapTheme()

# Median House Rent
map4 <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = MedRent), color = "#5D4954", size = 0.3) +
  scale_fill_viridis_c(option = "viridis", guide = "legend",
                       name = "Median Rent in Dollars") +
  labs(title = "Median House Rent by Census Tract",
       subtitle = "House rent in Philly",
       caption = "") +
  mapTheme()

# Percent White Population
map5 <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = pctWhite), color = "#5D4954", size = 0.3) +
  scale_fill_viridis_c(option = "viridis", guide = "legend",
                       name = "Percent White Population", 
                       breaks = seq(min(joined_data$pctWhite), max(joined_data$pctWhite), length.out = 5)) +
  labs(title = "Percent White Population by Census Tract",
       subtitle = "House rent in Philly",
       caption = "") +
  mapTheme()

map1

map2

map3

map4

map5

2.7 Other Maps

Some other maps that would be helpful to our analysis include kernel density maps that visualize the density and spread of the various selected public amenities and other nearby facilities/neighbourhood features of Philadelphia. These include tree, hospital, farmer’s market, school, play area, grocery store, and crime spread.

Map 1: Tree the greatest density of trees exist across central Philadelphia, across center city and university city, as well as various parts of northwest Philadelphia. Trees are more scarce in other areas.

Map 2: Hospital Hospitals are not very well-distributes across Philadelphia, with the greatest density in center city, and few in other areas.

Map 3: Farmers’ Market Farmers’ markets also seem to concentrate in center city, as well as university city and northwest Philadelphia.

Map 4: School Schools are most evenly distributed across the city, while the northeast Philadelphia area has less schools spatially.

Map 5: Grocery Store Grocery stores are also similarly well-spread throughout, with certain pockets of concentration in the typically dense areas (center city, university city) as well as a particular node in North Philadelphia.

Map 6: Play streets Play streets are visibly more concentrated in center city, with other nodes of concentration across the city.

Map 7: Crime Crime (the crime identified in this set are those self-chosen as crimes that would most affect residential areas, i.e. violent or property crime) has a distinct concentration in South Philadelphia and parts of North Philadelphia.

## TREE DENSITY
treemap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(tree)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Trees, Philadelphia") +
  mapTheme()

## MED DENSITY
medmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(med)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.7, bins = 20, geom = 'polygon') +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Hospitals, Philadelphia") +
  mapTheme()

## FARMERS MARKET DENSITY
fmarketmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(fmarket)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.7, bins = 20, geom = 'polygon') +
  scale_fill_gradient(low = "beige", high = "brown", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Farmers' Markets, Philadelphia") +
  mapTheme()

## SCHOOL DENSITY
schoolmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(school)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.8, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightyellow", high = "lightblue", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Schools, Philadelphia") +
  mapTheme()

## GROCER DENSITY
grocermap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(grocer)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.7, bins = 20, geom = 'polygon') +
  scale_fill_gradient(low = "lightyellow", high = "lightgreen", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Grocers, Philadelphia") +
  mapTheme()

## PLAY AREA DENSITY
playmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(tree)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightgreen", high = "lightblue", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Play Areas, Philadelphia") +
  mapTheme()

## CRIME DENSITY
crimepts <- crime %>%
  filter(text_general_code %in% c("Aggravated Assault No Firearm", "Thefts", "Other Assaults", "Theft from Vehicle", "Aggravated Assault Firearm", "Burglary Residential", "Robbery No Firearm", "Robbery Firearm", "Rape", "Homicide - Criminal"),
         lat > -1) %>%
  dplyr::select(lat, lng) %>%
  na.omit() %>%
  st_as_sf(coords = c("lng", "lat"), crs = "EPSG:4326") %>%
  st_transform('ESRI:102286') %>%
  distinct()

crimemap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(crimepts)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Violent Crimes, Philadelphia") +
  mapTheme()

treemap

medmap

fmarketmap

schoolmap

grocermap

playmap

crimemap

3. Methodology: Machine Learning Geospatial Analysis

Based on our data wrangling and data visualizations, we were able to select variables to best support a model to predict home value. Specifically, this model uses machine learning and geospatial analysis to create a hedonic model based on a range of variables. These variables are as previously described, including internal characteristics, public amenities/services, and other spatial information.

Among various statistical models abstracting reality, the Ordinary Linear Regression (OLS) model was chosen due to its relatively transparent process and computational efficiency. Initially, 70% of the data served as a ‘Training Set’ for building the predictive model, while the remaining 30% constituted the ‘Test Set’ for evaluating the accuracy and generalizability of the model. Through the evaluation, the test set was analyzed through various new variables (level of error, lag error, R-squared, and other variables).

These methods are explored and interpreted further in a step-by-step manner in the following section.

4. Hedonic Model Results

4.1 Table of Results (Training Set)

As previously mentioned, our original dataset is split into a training and test set, with the training set building the model. This modelling training is is applied an OLS regression model. For internal characteristics of units, we used the number of bedrooms, bathrooms, stories, year built, exterior, interior, garage space, fireplaces, view type, and total area. For amenities and public service we used crime, school, grocery, heat, trees, playstreets, and hospitals. As part of spatial features we looked at percent white, median household income, and percent poverty. This training regression is pulled to get a summary table of each of the variables as well as the r-squared. This summary table shows the estimate values for the coefficients as well as the r-squared value. The R-squared value is about 0.56, indicating that approximately 56% of the variance of sale price can be explained by our model.

sdmerge.modelling <- sdmerge %>%
  filter(toPredict == "MODELLING")

set.seed(750)
inTrain <- createDataPartition(
              y = paste(sdmerge.modelling$sale_price
                        ), 
              p = .70, list = FALSE)
training <- sdmerge.modelling[inTrain,] 
test <- sdmerge.modelling[-inTrain,]  

reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(training) %>% 
                             dplyr::select(sale_price,
                                           exterior_condition,
                                           garage_spaces,
                                           interior_condition, 
                                           MedRent,
                                           pctWhite,
                                           pctPoverty,
                                           number_of_bedrooms,
                                           number_of_bathrooms,
                                           number_stories,
                                           fireplaces,
                                           garage_spaces,
                                           view_type,
                                           year_built,
                                           total_area,
                                           crimebuff,
                                           crime_nn4,
                                           crime_nn5,
                                           playbuff,
                                           schoolbuff,
                                           school_nn2,
                                           school_nn3,
                                           school_nn5,
                                           medbuff,
                                           grocerbuff,
                                           treebuff,
                                           fmarketbuff,
                                           HVI_SCORE,
                                           ))

test <-
  test %>% #create predictions here, compare predictions to the observed
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000) 
# 
# mean(test$sale_price.APE, na.rm = TRUE)
# mean(test$sale_price)
# mean(test$sale_price.AbsError, na.rm = TRUE)

tidyr.t <- tidy(reg.training)
r_squared <- summary(reg.training)$r.squared
tidyr.t <- rbind(tidyr.t, c("R-squared", NA, NA, NA, r_squared))
kable(tidyr.t, format = "html", digits = 4, caption = "Summary Table for Training Set") %>%
  kable_classic(full_width = T, html_font = "Helvetica", font_size = 15) %>%
  row_spec(0, bold = T) %>%
  footnote(general = "", fixed_small_size = T, general_title = "")

Summary Table for Training Set
term	estimate	std.error	statistic	p.value
(Intercept)	-345427.576237817	87201.1290883935	-3.96127412395849	7.48551228104291e-05
exterior_condition	-22464.3204990188	2280.8224051337	-9.84921949576427	7.95201409155221e-23
garage_spaces	21445.0170492749	1846.15603569271	11.616037125068	4.47694619138194e-31
interior_condition	-27376.7537282187	2327.66129210213	-11.7614851529771	8.18917190512961e-32
MedRent	88.3085472141884	7.69777285965703	11.4719606338349	2.36040688662893e-30
pctWhite	71436.995176292	11047.5985833915	6.46629171372025	1.03175392846032e-10
pctPoverty	-19795.163407382	15307.0737134964	-1.29320363760503	0.195958219843113
number_of_bedrooms	-8928.06586540461	1405.74595219789	-6.35112329610164	2.19175899702311e-10
number_of_bathrooms	54200.6882615458	2832.14465853308	19.1376835566087	8.67277724331793e-81
number_stories	55469.0142181736	2561.34498284452	21.65620585657	1.29228892205375e-102
fireplaces	131977.050015673	4871.49752175237	27.0916795967491	2.79752026020309e-158
view_type0	30164.7663167493	39225.3064320629	0.76901289143512	0.441896346863053
view_typeA	42560.5610636493	34061.9525013105	1.24950444523143	0.211497840301431
view_typeB	22733.8446905374	44665.9572640299	0.508974755788907	0.610776528067202
view_typeC	137058.493672024	34449.3418189484	3.97855188039143	6.96239442162772e-05
view_typeD	1320.74763122881	38150.6467765912	0.0346192723537051	0.972383740258559
view_typeE	5175.73188579092	38303.2582857241	0.135125107299813	0.892514560088899
view_typeH	43697.3713165848	36831.7445584247	1.1864051469859	0.235478928947373
view_typeI	28241.8319466729	32914.2503607722	0.858042690844086	0.390880993879816
year_built	181.63042596967	41.0328751236527	4.42646111008126	9.63883824241568e-06
total_area	13.0919713374087	0.427334465341374	30.6363572312153	1.11090627608545e-200
crimebuff	-513.734803477118	65.6222415318643	-7.82866893121387	5.21942684322719e-15
crime_nn4	-1370.77359094705	294.237347651186	-4.65873418819723	3.20571756978093e-06
crime_nn5	1580.60105777607	285.339657858229	5.53936690623423	3.08070891665012e-08
playbuff	-435.669818952762	276.452786302405	-1.57592847871026	0.115060904345149
schoolbuff	-1941.00471070437	1061.19115881317	-1.82908111755773	0.0674049745077005
school_nn2	60.4001103179244	29.7042248738397	2.03338449578997	0.0420292012746491
school_nn3	-82.1364671700063	44.9218890835185	-1.82842860898611	0.067502773298799
school_nn5	44.5936858264166	24.1702810863808	1.84498002596849	0.0650578358860428
medbuff	18077.5575981134	1675.27114253341	10.7908249232872	4.65499360238264e-27
grocerbuff	1588.25993391847	423.854848752827	3.7471788717101	0.000179426257512695
treebuff	120.486094704403	3.69088968185975	32.6441874696436	9.25138387252468e-227
fmarketbuff	306.297353615634	1351.17667003135	0.226689344487073	0.820668040361049
HVI_SCORE	-1064.31695809294	996.774445125994	-1.06776108004897	0.285643468590333
R-squared	NA	NA	NA	0.569180698634477

4.2 Table of goodness of fit (Test Set)

The following table shows the r-squared, or the the proportion of variance in the dependent variable that can be explained by the independent variable, the Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). While important in evaluating the model, R-squared does not indicate whether the model itself is reliable, or whether the coefficient estimates and predictions are biased. Thus, the following table looks at the goodness of fit for the test set of the model, to better understand the predictive accuracy and performance of the model. According to this table, the calculation of the MAE for this seed/iteration of the model is about $76913, while the MAPE (mean absolute percentage error) is about 47%. These metrics offer insights into the model’s predictive accuracy and the degree of error in the predicted values. Given that lower values for MAE and MAPE are desirable, this means that our model is of higher error. This will be further explored in the following sections.

R = cor(test$sale_price, test$sale_price.Predict, method = "pearson", use = "complete.obs")
R2 = R*R
MAE = mean(test$sale_price.AbsError, na.rm = T)
MAPE = mean(test$sale_price.APE, na.rm = T)

testsum <- as.data.frame(cbind(R2, MAE, MAPE))

colnames(testsum)<-c("R Squared", "MAE", "MAPE")

kable(testsum, caption = "", align = "c") %>%
  kable_classic(full_width = T, html_font = "Helvetica", font_size = 15) %>%
  row_spec(0, bold = T) %>%
  footnote(general = "", fixed_small_size = T, general_title = "")


R Squared	MAE	MAPE
0.5583617	76894.81	0.556145

4.3 Cross-Validation MAE histogram

A cross-validation MAE histogram will allow us to evaluate the generalizability of the model over different data iterations. According to the histogram, the mean absolute error has a frequency curve that is highest at approximately $80k, which is very similar to its mean value. This spread however is not symmetrical, with a somewhat wider spread and certain outliers (particularly towards the high end) meaning our model may not be as accurate.

fitControl <- trainControl(method = "cv", number = 100)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(sdmerge.modelling) %>% 
                             dplyr::select(sale_price,
                                           exterior_condition,
                                           garage_spaces,
                                           interior_condition, 
                                           MedRent,
                                           pctWhite,
                                           pctPoverty,
                                           number_of_bedrooms,
                                           number_of_bathrooms,
                                           number_stories,
                                           fireplaces,
                                           garage_spaces,
                                           view_type,
                                           year_built,
                                           total_area,
                                           crimebuff,
                                           crime_nn4,
                                           crime_nn5,
                                           playbuff,
                                           schoolbuff,
                                           school_nn2,
                                           school_nn3,
                                           school_nn5,
                                           medbuff,
                                           grocerbuff,
                                           treebuff,
                                           fmarketbuff,
                                           HVI_SCORE
                                           ),
     method = "lm", trControl = fitControl, na.action = na.pass)


MAE.cv = data.table(MAE.cv= reg.cv$resample[,3])
MAE.sd <- as.character(round(sd(reg.cv$resample[,3])))
MAE.mean <- as.character(round(mean(reg.cv$resample[,3])))

caption <- paste("Sd =",MAE.sd, "  Mean =",MAE.mean, " K = 100")

ggplot(MAE.cv, aes(x = MAE.cv))+
  geom_histogram(fill=bluePalette5[3], color = greyPalette5[1], size = 2, xlim=c(20,50))+
  labs(title = "Cross-Validation MAE histogram",
       subtitle = caption,
       x="MAE",
       y="Count",
       caption = "")  +
  plotTheme()

4.4 Predicted prices as a function of observed prices

The following scatterplot plots predicted prices as a function of observed prices. According to this scatterplot, the model performs rather well, considering the line of best fit for the the points almost completely is overlapped by the line from the prediction model. However, it is still interesting to observethe various outliers that would result in high error, particularly those with a much higher observed price than predicted price. These points are visualized much higher than the prediction line.

set.seed(750)

sdmerge.all <- sdmerge %>%
  filter(toPredict %in% c("MODELLING", "CHALLENGE"))
 
reg.sdmerge.all <- 
  lm(sale_price ~ ., data = as.data.frame(sdmerge.all) %>% 
                             dplyr::select(sale_price,
                                           exterior_condition,
                                           garage_spaces,
                                           interior_condition, 
                                           MedRent,
                                           pctWhite,
                                           pctPoverty,
                                           number_of_bedrooms,
                                           number_of_bathrooms,
                                           number_stories,
                                           fireplaces,
                                           garage_spaces,
                                           view_type,
                                           year_built,
                                           total_area,
                                           crimebuff,
                                           crime_nn4,
                                           crime_nn5,
                                           playbuff,
                                           schoolbuff,
                                           school_nn2,
                                           school_nn3,
                                           school_nn5,
                                           medbuff,
                                           grocerbuff,
                                           treebuff,
                                           fmarketbuff,
                                           HVI_SCORE,
                                           ))

sdmerge.all <-
  sdmerge.all %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.sdmerge.all, sdmerge.all),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000) 

ggplot(sdmerge.all, aes(x = sale_price.Predict, y = sale_price)) +
  geom_point(size = 0.8, alpha = 0.8, color = "darkblue") +
  geom_abline(color = greyPalette5[4]) +
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 1) +
  labs(title = "Predicted prices as a function of Observed prices",
       subtitle = "Grey line: Line of Best Fit | Blue line: Prediction",
       x = "Predicted price",
       y = "Observed price",
       caption = "") +
  plotTheme()

4.5 Map of residuals with Moran’s I test

The following Map of Residuals of the test set of the model allows us to understand the error and accuracy of the model. In the map, points with a deeper blue shade are overvalued while very light shades are undervalued. Unfortunately, our model appears not to be performing well across different neighbourhoods in Philadelphia, with deep blue (overvaluing home value) in parts of Fishtown and South Philadelphia, and light blue (near white) more towards the northeast fringes. These spatial patterns will be examined through the Moran’s I test below.

ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = test, aes(colour = q5(sale_price.Error)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Residuals (Test Set)", 
       subtitle = "",
       caption = '') +
  scale_colour_manual(values = bluePalette5,
                      labels=round(as.numeric(qBr(test,"sale_price.Error"))),
                      name="Residuals ($)",
                      na.translate=FALSE) +
  mapTheme()

The following Moran’s I test and chart visualize the spatial autocorrelation of the observations. The point of the test is to understand the similarity of home value spatially–that is, to those near a particular home. The line is slightly to the right of the average, suggesting that there is a moderate level of positive spatial autocorrelation or clustering, meaning certain homes close together may be similar in price, but this may not be the case across the board–especially for extreme values.

coords.test <-  st_coordinates(test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5)) # k = 5, five nearest points

spatialWeights.test <- nb2listw(neighborList.test, style="W")

test$lagPrice <- lag.listw(spatialWeights.test, test$sale_price, NAOK = TRUE)
test$lagPriceError <- lag.listw(spatialWeights.test, test$sale_price.Error, NAOK = TRUE)

#imputing mean values!
missing_values <- which(is.na(test$sale_price.Error))
test$sale_price.Error[missing_values] <- mean(test$sale_price.Error, na.rm = TRUE)

moranTest <- moran.mc(test$sale_price.Error, spatialWeights.test, nsim = 999)

#saveRDS(moranTest, file = "moranTest.RDS")

MT <- ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

SL1 <- ggplot(test, aes(x = lagPrice, y = sale_price))+
  geom_point(size =0.8, alpha = 0.8, color = bluePalette5[3])+
  geom_abline(color = greyPalette5[4])+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Price vs Spatial Lag Price",
       subtitle = "Grey line : Perfect     Blue line : Average",
       x="Spatial Lag Price",
       y="Observed price",
       caption = " ") +
  plotTheme()
 

SL2 <- ggplot(test, aes(x = lagPriceError, y = sale_price.Error))+
  geom_point(size =0.8, alpha = 0.8, color = bluePalette5[3])+
  geom_abline(color = greyPalette5[4])+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Price Error vs Spatial Lag Price Error",
       subtitle = "Grey line : Perfect     Blue line : Average",
       x="Spatial Lag Error",
       y="Error",
       caption = "") +
  plotTheme()


MT

Lastly, the following two charts show functions to their respective spatial lag value. The concept of price clustering is best visualized here. We see a rather strong positive relationship with the price value and the spatial lag price, meaning that an increase in the price results in an increase of the spatial lag price, representing nearby houses. This is similar for error as well, where greater price error results in greater price error for nearby houses.

grid.arrange(SL1, SL2, ncol=2)

4.6 Map of Predicted Values

The following maps visualize all the sale price values across Philadelphia. The map on the right has predicted values while the map on the left with observed values. Based on these two visualizations, the prediction somewhat matches that of the observed values when comparing across neighborhood clusters.

PP <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = sdmerge.all, aes(colour = q5(sale_price.Predict)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Predicted House Price", 
       subtitle = "Entire data set",
       caption = '')+
  scale_colour_manual(values = orangePalette5,
                      labels=round(as.numeric(qBr(sdmerge.all,"sale_price.Predict"))/1000),
                      name="Predicted\nPrice\n(1,000$)",
                      na.translate=FALSE)+
  mapTheme()

OP <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = sdmerge.all, aes(colour = q5(sale_price)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Observed House Price", 
       subtitle = "Entire data set",
       caption = '')+
  scale_colour_manual(values = bluePalette5,
                      labels=round(as.numeric(qBr(sdmerge.all,"sale_price"))/1000),
                      name="Observed\nPrice\n(1,000$)",
                      na.translate=FALSE)+
  mapTheme()

grid.arrange(OP, PP, ncol = 2)

4.7 Map of MAPE by Census Tract

Using the test set predictions, the following map was created below to visualize mean absolute percentage error (MAPE) across Philadelphia. In this case, census tract was chosen as the level of geography.

However, it is important to note that many of the MAPE values were not included in the map because they were infinite and unable to be represented in the map. Only finite values for this column were included, leading to less visualized datapoints. Also, given that the values are mean values across the census tract, it is possible that the APE values for different homes in one census tracts were very high or very low, leading to a “cancelling out” effect where the MAPE appears lower than it should for the error that occurred in the model. This also could have lead to the infinite values that could not be mapped at all.

test.MAPE <- test %>%
  group_by(census_tract) %>%
  summarise(meanMAPE = mean(sale_price.APE), meanPrice = mean(sale_price)) 

test.MAPE <- st_join(sd_ct, test.MAPE, join = st_intersects)
test.MAPE <- st_join(nhoods, test.MAPE, join = st_intersects)
test.MAPE <- test.MAPE[is.finite(test.MAPE$meanMAPE), ]

ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = test.MAPE, aes(fill = meanMAPE),  size = 0.5) +
  scale_fill_viridis_c(option = "viridis", guide = "legend",
                       name = "Mean Average Percent Error", 
                       breaks = seq(min(test.MAPE$meanMAPE), max(test.MAPE$meanMAPE), length.out = 5)) +
  labs(title = "Mean Average Percent Error across Census Tracts, Philadelphia",
       subtitle = "",
       caption = "") +
  mapTheme()

4.8 Scatterplot plot of MAPE by census tract as a function of mean price

The following scatterplot shows MAPE by census tract as a function of mean price by census tract. Similar to the map, this plot visualizes most of the available observations having very low MAPE values. It is very likely that the MAPE are not accurately represented in this scatterplot, resulting in a skewed diagram, given that the previous tables showed higher absolute error.

ggplot(test.MAPE , aes(x = meanMAPE, y = meanPrice))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Scatterplot of MAPE as a function of mean price",
       subtitle = "by census tract",
       x="MAPE",
       y="price",
       caption = "") +
  theme(axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 10),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, face ="italic"),
        plot.caption=element_text(hjust = 0.5)
        )

4.9 Split by Census Groups

We split our study area into two groups with respect to poverty and income. In the poverty context map below, the blues correspond to areas with low poverty and white correspond to areas with high poverty. Also, in the income context map, the blues correspond to high income and whites correspond to low income. When we compare this to the map of predicted house prices, it is evident that the model accurately predicts for relatively high and relatively low values accurately. Thus, the model is generalizable for Philadelphia. For example, the Western edge of the city, where there is low poverty and high income has predicted house values which are quite high. Whereas, for the southwestern part of the city which has high poverty and low income, the model predicts lower house prices. This is of course based on the assumption that areas with low poverty and high income already correspond to higher observed house prices.

nhoods.split <- nhoods %>%
  mutate(povertyContext = ifelse(pctPoverty > .15, "High Poverty", "Low Poverty"),
         incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High", "Low"))

CT1 <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = nhoods.split, aes(fill = povertyContext), color = greyPalette5[2], size = 0.2) +
  scale_fill_manual(values = c(bluePalette5[1],bluePalette5[4]),
                    name = "Poverty Context",
                    na.translate=FALSE) +
  labs(title = "Poverty Context", 
       subtitle = "Low Poverty Percentage : Percent of poverty < 15%",
       caption = '') +
  mapTheme()

CT2 <- ggplot() +
  geom_sf(data = ct, color = "#5D4954", size = 0.3) +
  geom_sf(data = nhoods.split, aes(fill = incomeContext), color = greyPalette5[2], size = 0.2) +
  scale_fill_manual(values = c(bluePalette5[4],bluePalette5[1]),
                    name = "Income Context",
                    na.translate=FALSE) +
  labs(title = "Income Context", 
       subtitle = "High Income : Income > Median Income",
       caption = '') +
  mapTheme()

grid.arrange(CT1, CT2, ncol = 2)

5. Model Discussion

This model is not the most effective and we would not recommend it to Zillow because of the high values of mean absolute errors and mean absolute percent error. This model does predict most values accurately but is poor in accounting for outliers. Some of the most interesting variables were demographic like median household income and percent poverty because Philadelphia is already a very polar city; areas very distinctly have very high or very low incomes and this very directly helps us predict for the city well. It was also interesting to see variables like heat and trees feed into the model to make it more accurate. This means that people have in the past and do very actively care about climatic factors and want to take that into consideration while buying a home.

We could predict variation in prices closest to the mean well but not for outliers. According to the scatterplot under 5.4, the model performs rather well, considering the line of best fit for the points almost completely is overlapped by the line from the prediction model. However, it is still interesting to observe the various outliers that would result in high error, particularly those with a much higher observed price than predicted price. Thus, the model does not predict outliers that well, suggesting it is not generalizable beyond its current observations.

The model does take into account very interesting factors and features that make it much more local and context based. Adding some of these has proven to improve the models fit like trees, crime and heat index, getting the accuracy of the model to 56%. This model thus performs well in generalizability but poorly in accuracy because of outliers.

According to our maps, the model performs very well in corresponding to areas with high or low observed house prices. The two maps are very aligned and it shows that the model accurately corresponds to different neighborhoods in Philadelphia.

6. Conclusion

In conclusion, our model was only able to account for about 56% of the variation in home value. To improve our model in the future, we would consider further variables like percent home ownership, proximity to public transit, transit patterns of homeowners, etc. This model can also be improved by using more recent (2023) data, and by accounting for external occurrences that cause anomalies, namely the COVID-19 pandemic’s impact on the job market and the new wave of remote work on real estate. Regardless, this model provides an excellent baseline hedonic model of which can be improved upon to support Zillow’s home value estimates.

Zillow Home Value Predictions

A hedonic home price prediction model to help Zillow better predict housing value.

Stephanie Cheng, Shreya Bansal

2023-10-20