Summary

Real estate listing descriptions offer insight into market inventory and perceived buyer preferences. Overall themes in the listings and the evolution of themes through time are an important summary of local market conditions. Real estate markets exhibit complicated dynamic behavior, and learning time-varying themes from listings is challenging for two reasons: first, the listings are textual descriptions of a home; and second, the descriptions reflect the local market at a single point in time. In this brief, we describe the Dynamic Linear Topic Model, a new method of learning from text that allows us to overcome both of these challenges. In the listings data, we learn themes related to location, outdoor spaces, renovations, views, and more. For each theme, we learn a unique time-varying pattern. Interactive visualizations of the learned topics illustrate their dynamic behavior.

Unstructured data is a rich source of information

Data can typically be categorized as either structured or unstructured. Structured data, like that found in spreadsheets and relational databases, has a standard format. The prices of homes in a neighborhood are an example of structured data: they can be presented in a two column table with one column for an identifier of each house and the second column for the price. As a result, structured data is easy to query, sort, and explore with standard tools. By contrast, unstructured data -- like pictures, video, and text -- lacks a standard format for presenting information. Idiosyncracies in form and style mean that identifying valuable pieces of information is more difficult. The challenge of learning from unstructured data is pushing the frontiers of artificial intelligence, machine learning, and statistics.

At Zillow, structured data has been the driver of innovations such as the Zestimate and Zillow Home Value Index (ZHVI). Directly quantifiable characteristics of a home like square footage, year built, and number of bedrooms facilitate analytical comparisons of one property to another, which is useful for buyers, sellers, and real estate professionals; however, the unique character that distinguishes a house from a home is difficult to quantify with numbers alone. The feeling of home is more often evoked by pictures, videos, and text descriptions.

In this post, we describe one way that important features of a home can be extracted from listing descriptions. Unstructured listing descriptions complement pictures and videos by offering information and context that visual media cannot fully provide. As an example, details about renovations and original elements of the home can be easily conveyed with text, whereas critical updates to plumbing, heating/cooling, and electrical systems are not easily captured by pictures. The same is true of specific details about building materials such as hardwood flooring, cabinets, and countertops. For some buyers, oak floors and granite counters are attractive features, but the actual building materials are not always clear in pictures. Important aspects of the neighborhood and surrounding environment are also included in text descriptions. Terms like family-friendly, vibrant, and diverse are keywords that shed light on the nature of a community.

Real estate listings exhibit complicated time-varying patterns

The goal of this analysis is threefold: first, we want to learn the important themes in a particular market from a set of real estate listings and understand how these themes change over time; second, we want to understand the relative importance of each theme in the overall market; and third, we want to understand how much each theme contributes to individual listing descriptions.

One way of learning themes from text is with topic models. Topic models represent themes in text with probabilities; a topic is defined as a probability distribution over all the possible words that could be used to create a document. A real estate topic about kitchens may place high probability on words like appliances, stainless steel, and granite but very low probability on words like garden, patio, and backyard. On the other hand, a topic focused on outdoor spaces may place high probability on garden, patio, backyard, but give very little weight to kitchen-related terms. Our first goal is to learn these topics from the listing descriptions.

One particularly important element of real estate markets is that they are always changing. Evolution in market conditions is reflected in prices, the amount of time needed to sell a home, and even in the listing description. Sellers and agents describe a home to make it most appealing in a dynamic marketplace. Emphasizing a home's garden when selling in the summer or it's fireplace when selling in winter is a natural way of appealing to current buyer preferences. While some market themes may recur seasonally, others may significantly increase (decrease) in importance from one year to the next, like when neighborhoods become more desirable or fall out of favor. Such market trends can persist for many years. Our second goal is to learn the complicated dynamic patterns that topics exhibit.

While some Zillow users are interested in overall market patterns, most are looking for a specific home. The choice of a home is very personal: some prefer homes that are cozy and cute; others are looking for renovated classics. Ideally, the unique features of a home match the individual preferences of the buyer. Classifying individual listings by their themes may lead to improved recommendations to Zillow users. This is why our third goal is to understand how much each topic contributes to individual listings.

Real estate listings from Zillow

The data used for this analysis is a collection of more than 11,000 listing descriptions from Seattle, Washington from November of 2007 to March of 2017. Each listing is summarized by the number of times each word appears. This bag of words approach is standard in topic modeling, and it facilitates the dynamic modeling strategy discussed below.

Dynamic Linear Topic Models (DLTM)

To describe the statistical method used to learn the topics and their dynamic patterns, let's think about a procedure for writing a new listing . The procedure begins with a vocabulary, which is the predefined set of words that can appear in all the listings. Next, we choose the number of themes that we want the collection to contain. For simplicity, let's imagine a set of listings containing two different themes: location and indoor spaces. Each of these topics will correspond to a different bag of words. We can think of each bag containing ping pong balls marked by vocabulary words. For example, in the bag of words associated with location, there will be many ping pong balls with words like neighborhood and school so that if we reach into the bag and randomly select a ball, it is likely that the word on the selected ball will describe the location.

Let's write a new listing in March 2018 with 100 words. To choose the first word in the listing, we need to reach into one of the two bags to draw a ping pong ball. Which one do we choose? In this model, we choose the bag by flipping a coin. If we flip heads, we reach into the location bag. If we flip tails, we choose from the indoor spaces bag. Let's say we flipped tails, and we choose the first word in our listing by reaching into the indoor spaces bag and randomly drawing one of the ping pong balls. It's marked kitchen. Kitchen is the first word in our new listing.

To write the second word, we repeat the process by flipping the same coin again to choose one of the two bags. Let's say this time we flipped heads, and we draw the word downtown from the location bag. Downtown is our second word. This process repeats until the hundredth word is selected and the new listing is complete.

In June 2018, we want to write another new listing. To represent the passage of time, the bags of words are slightly changed. In the location bag, there are a few more ping pong balls with the term lake and a few less ping pong balls with highway. The indoor spaces bag will also slightly change. In addition to the bags themselves changing, the coin we flip to choose the bag also changes. Our new coin may not have the same probability of flipping heads as the original coin. For example, instead of a 50-50 chance of flipping heads with the original coin, suppose that there is a 70 percent chance of flipping heads with our new coin. With both the bags of words and the coin changed by the passage of three months, we again write the document by flipping the coin, choosing a bag, and selecting a ping pong ball from the bag.

In this example, we are writing new listings with a randomized procedure. When analyzing the actual data, we reverse the process. The listings are observed, but we do not observe the bags of words or the coin used to choose the bags. We must learn them. Our first goal is to use the listings data to estimate the percentage of ping pong balls in the location bag associated with each word. We estimate the percentages in each bag of words at each time point. This corresponds to learning the topics and how the topics change over time. Our second goal is to learn the probability of flipping a heads (i.e. selecting the location topic). We want to learn the probability of each topic at each time point. In models with more than two topics, the process is similar to flipping a coin with many sides. This corresponds to learning the importance of each topic in the corpus as a whole and how that importance changes over time.

The innovation of the Dynamic Linear Topic Model is the ability to model complicated patterns in the probability of selecting each topic. Imagine that we want to use the same coin to select topics in March 2019 as we do in March 2018. It is possible to model this repeated use of the same coin as seasonality in the topic probabilities. It is also possible to model probabilities that rapidly increase or decrease from one month to the next with polynomial trends.

Topics in the listings data

The topics that we present below were learned from the data without any specification of overall theme or any other user guidance. The topics reflect clusters of words that appear together with high probability. The names are given after the analysis by observing the theme of some top keywords.

The topics are ordered in terms of their overall prevalence in the listings data. Topic 1 is the most prevalent topic in the collection. Topic 10 is the least prevalent. The visualization below presents the average contribution of the 10 topics used in the analysis to the collection of listings as a whole. The figure presents the time-varying patterns that each topic exhibits (objective two of the analysis) from Q1 2008 to Q4 2016. Detailed comments about the dynamic behavior of each topic are made when the topics themselves (objective one) are introduced. The breakdown of each individual listing by topic (objective three) is available but not the focus of the results below.

Data: Z_prob • Chart ID: MotionChartID182b34b303a10googleVis-0.6.2

Topic 1: Common elements of a home

The first topic represents the common elements of a home. With keywords like room, bedroom, bath, floors, dining, and kitchen, the topic identifies words that are used most frequently to describe the inside of a house. The keyword probabilities slightly fluctuate through time, but the overall theme remains clear. Unsurprisingly, this topic is the most prevalent in the collection.

On average, 30% of each listing is consistently characterized by the common elements of a home.

Data: Topic • Chart ID: MotionChartID182b34b341700googleVis-0.6.2

Topic 2: Location

Topic 2 is defined by keywords like home, parks, located, close, access, downtown, Seattle, shops, restaurants, school, walk, rail, and neighborhood. A home's location is the clear theme of this topic. Note that in the first quarter of 2008, the most likely word in the topic is home, but by Q4 2016 parks emerges as the most likely keyword. The overall probability of home is cut in half, and other location-themed terms have increased in probability.

When the proportional contribution of topic 2 to the corpus as a whole is examined, we find that the prevalence of topic 2 peaks in September of 2008 (21.7%). It bottoms out in May 2010 (13.8%), and then steadily climbs through 2016 to recover the September 2008 level.

Data: Topic • Chart ID: MotionChartID182b348d22792googleVis-0.6.2

Topic 3: Modern design elements and outdoor spaces

Topic 3 is characterized by keywords home, design, natural, light, open, spaces, modern, private, outdoor, garden, deck, patio, and landscaping. While the probabilities of these keywords are necessarily small because of the large number of words in the vocabulary, the probabilities of many terms undergo significant changes from Q1 2008 to Q4 2016: the term spaces increases in probability from 0.0214 to 0.0496, a 130% increase over 9 years; the term open increases in probability by 31%, and the probability of outdoor increases by 83%.

In addition to the evolution of the term probabilities within the topic, the average proportion of listings that are characterized by topic 3 may change with time. In topic 3, it is natural to expect seasonal fluctuations given the outdoor emphasis of the topic. We find that topic 3 is typically more prevalent in Q2 and Q3 and is markedly lower in Q1 and Q4.

Data: Topic • Chart ID: MotionChartID182b35d240797googleVis-0.6.2

Topic 4: Remodeled and renovated homes

In topic 4, keywords include new, updated, windows, roof, painted, plumbing, remodeled, kitchen, refinished, hardwoords, and updated. These words all describe homes that are renovated. The term newer increases in probability from 0.013 in Q1 2008 to 0.062 in Q4 2016, a 376% increase.

On average, 10% of each listing is characterized by home improvements and updates.

Data: Topic • Chart ID: MotionChartID182b3686e7a46googleVis-0.6.2

Topic 5: Luxury items

Luxury finishes are captured in topic 5 with keywords like master, bath, suite, kitchen, granite, counter, stainless, appliances, tub, shower, gas, range, walk, closet, and custom. The focus on luxury kitchens significantly increases from 2008-2016 with two separate jumps. The first jump occurs from Q1 2008 to Q1 2010, with the probability of term kitchen increasing by 60% over those two years. A second jump occurs from Q1 2014 to Q4 2016, with kitchen increasing in probability by an additional 30%.

Luxury finishes accounted for 13% of each listing on average in Q1 of 2008. By Q3 2008, only 9% of each listing was characterized by luxury finishes. This rapid 30% decrease in the prevalence of luxury finishes coincides with the subprime mortgage crisis of 2007 - 2009.

Data: Topic • Chart ID: MotionChartID182b32d1dbf3agoogleVis-0.6.2

Topic 6: Charm

The emphasis of topic 6 is on features of a home that are often identified as charming. Keywords include charm, craftsman, bungalow, updated, porch, tub, nook, original, ballard, queen, anne, and vintage. Charm and craftsman account for more than 16% of the probability in this topic. Also note that the popular Seattle neighborhood of Ballard increases in probability by 50% over 9 years.

On average, 5% of each listing is characterized by what we are calling the charming features of a home.

Data: Topic • Chart ID: MotionChartID182b37bf58ff9googleVis-0.6.2

Topic 7: Scenery and views

The view of a home is clearly the theme of topic 7, with 20% of the topic allocated to the single keyword view. Other keywords include deck, lake, washington, puget, sound, mountains, olympic, cascade, sunsets, mt, rainier, city, and beach. The term Puget, referring to body of water Puget Sound, increases in probability by 78% over the 2008 - 2016 time period.

The proportion of each listing characterized by the view is a nearly constant 5% over that same time period.

Data: Topic • Chart ID: MotionChartID182b336bf6403googleVis-0.6.2

Topic 8: Classic homes

We are calling topic 8 the classic homes topic, with keywords french, doors, details, windows, glass, tiled, tudor, crown, molding, ceilings, backyard, original, built, oak, and leaded. Though more difficult to label than some other topics, these keywords evoke images of homes from a different era.

The prevalence of topic 8 initially climbs and reaches its peak prevalence of 8.1% in Q4 of 2010. It subsequently falls by half over the next 6 years and finishes Q4 2008 at 3.9% of each listing (on average).

Data: Topic • Chart ID: MotionChartID182b34f3f1d09googleVis-0.6.2

Topic 9: University of Washington

The evolution of topic 9 is the most dramatic of all the topics. In 2008, the dominant keyword is newer, but beginning in 2010, a new set of keywords emerges and newer significantly drops in prominence. By the end of 2016, the topic is characterized by keywords UW, university, village, ravenna, trees, plantings, burke, gilman, trail, childen, hospital. This topic clearly identifies the University of Washington (UW) area of Seattle. Shopping plaza University Village, the walking and biking Burke Gilman Trail, and residential neighborhood Ravenna give topic 9 a clear geographic theme. The University of Washington Hospital and Seattle Children's Hospital are nearby as well.

On average, 1% of each listing is allocated to keywords affiliated with topic 9.

Data: Topic • Chart ID: MotionChartID182b3388b6284googleVis-0.6.2

Topic 10: Spurious topic

One of the biggest challenges in topic modeling is knowing how many distinct topics the collection of documents contains. In this analysis, we prespecified that there were 10 topics in listings data, but after computing the individual topics and probailities, we learn that the tenth topic is not needed. If we re-run the computational method multiple times, the first nine topics are identically learned in each of the runs, but topic 10 is always different. The inability to replicate the tenth topic tells us that whatever theme it may have is not reliable. We select the number of topics required to characterize a collection of documents by using numerical summaries that tell us how well the data is described with different numbers of topics (i.e., 5, 10, 15, 20, 25 topics) and examining whether topics are identically learned when we perform the analysis again. We found that using 9 topics is the best and most reliable way to characterize this collection.

Data: Topic • Chart ID: MotionChartID182b346f0e51cgoogleVis-0.6.2
R version 3.3.1 (2016-06-21) • Google Terms of UseDocumentation and Data Policy