Data in Real Estate (Part 1): Creating Accessibility


The real estate industry has been affected by the nearly infinite amount of information available through the Internet in the same way that all industries have been. Consequently it is now more important than ever that the information clients receive is accurate and reliable. Understanding the nature of this data and the way in which it is interpreted and coordinated by real estate Web sites can contribute to an awareness of the complexity surrounding such data management, as well as the way in which that complexity is being simplified for the clients served.

But what is data, exactly? Raw data, data normalization, aggregate data — the word is thrown around with such frequency that it may prove difficult to come up with a concrete and coherent definition of such a broad term, even when applied to the real estate field.

Before a company can provide the data its clients want — quality data — it is vital that it has at least a basic understanding of what data and terminology associated with data mean in real estate.

I. Data and its Processes

The use of the broad term “data” can be broken down into several specialized areas.

Raw data is the most basic form of data. It is sometimes called primary data and provides researchers with direct information about the concept under investigation. Raw data may include original research or new information that has never been publicized before.

  • Often such data may contain errors (like redundancies or repetition), is uncoded, unformatted or in several different formats, or needs to be put in some logical order. For example, the different date formats 31st January 1999, 31/01/1999, 31/1/99, 31 Jan, etc. would all need to be processed and stored as a single format.

  • Data normalization is a method of reducing the duplication of information. Normalization prevents multiple occurrences of a given piece of information within a database or table, which would create inconsistencies when updating data within the table.

Secondary data is gathered and processed by a third party and used in order to further the specific research being undertaken. This data is produced by analyzing, explaining, and combining the information from the primary source with additional outside information (establishing correlations, drawing conclusions, etc.).

  • Data analysis, then, is the process of examining and summarizing data in order to identify useful information, placing emphasis on making inferences about the relationships found there.

  • U.S. Census Bureau projections, Bureau of Labor employment data, IRS statistics, and FBI crime information are several of the most popular sources from which real estate companies secure this type of secondary data. For example, the following is taken from the 2006 U.S. Census survey of median family income by state over the past 12 months, a study which was itself taken from an American Community Survey.

New York Estimate Margin of Error

Total: 62,138 +/-364

No earners (dollars) 25,014 +/-594

1 earner (dollars) 43,352 +/-771

2 earners (dollars) 80,015 +/-612

3 or more earners (dollars) 100,049 +/-1,295

This data was gathered and organized by the ACS and Census Bureau, and was ultimately compiled in such a way that it serves a purpose. Rather than have a random list of income levels in no discernable order, the processing done here enables a real estate company to present the data in a format that makes sense to a reader (by state and by number of earners).

  • <!--[if !supportLists]-->· <!--[endif]-->Aggregate data describes high-level data combined from a multitude or combination of other individual data sources. For instance, the data of an entire sector of an economy is aggregated by merging that of all households in a city or region.

It therefore makes sense that error detection and correction are important in maintaining data integrity. Error detection enables the discovery of errors caused by “noise” or other impediments that arise during communication between sender and recipient. Once detected, error correction allows reconstruction of the original, error-free data.

II. Data Usage in Real Estate – Geographic Concepts

Web-based real estate data aggregators frequently utilize geographic concepts and terminology in their products and services. Such terms are vital in order to create a standardized “language” that can be used in dealing with real estate data. Building on an awareness of general data by becoming familiar with these more specific concepts will aid in overall navigation and comprehensibility of real estate Web sites.

GIS (Geographic Information System) refers to a system designed to integrate, store, edit, analyze, share, and display geographic information. By utilizing GIS programs, users can create interactive queries, analyze spatial information, edit data and maps, and display the results of these processes.

Although GIS is most commonly associated with maps, it has other functions as well. GIS can be used to create a geographic database, or geodatabase, which allows users to store, query, and manipulate the geographic information and spatial data being examined. Map views allow for depiction and editing of geographic features, and model views analyze information from different sets of data to create an integrated presentation.

In using GIS tools to talk about the various locations dealt with in real estate, data will often be sorted and analyzed according to specific categories. While some of these — city, county, and neighborhood — are common in everyday speech, others — minor civil divisions, census tract — may be less familiar to a reader.

A city is a type of incorporated place in all states and the District of Columbia, the boundaries of which are defined.

A county is the primary legal division of every state except Alaska and Louisiana. A number of geographic entities are not legally designated as a county, but are recognized by the U.S. Census Bureau as equivalent to a county for data presentation purposes (the boroughs, city and boroughs, municipality, and census areas in Alaska; parishes in Louisiana; and cities that are independent of any county in Maryland, Missouri, Nevada, and Virginia). Because it contain no primary legal divisions, the Census Bureau treats the District of Columbia as equivalent to a county (and a state) for data purposes.

A place is a concentration of population that must have a name, be locally recognized, and not be part of any other place. Places typically have a residential nucleus, a closely spaced street pattern and frequently have commercial or other urban types of land use.

CDPs are delineated for each decennial census as the statistical counterparts of incorporated places such as cities, towns and villages. They lack separate municipal governments, but otherwise physically resemble incorporated places. CDPs are delineated to provide data for settled concentrations of population that are identifiable by name but are not legally incorporated under the laws of the state in which they are located.

The boundaries of a CDP have no legal status and may change from one census to the next to reflect changes in settlement patterns. Further, as statistical entities, the boundaries of the CDP may not correspond with local understanding of the area with the same name.

For example, the CDP designation may apply to large military bases (or parts of a military base) that are not within the boundaries of any existing community, such as Fort Campbell North and Fort Knox in Kentucky.

  • An incorporated place is a type of governmental unit incorporated under state law as a city, town, borough, or village (exceptions in New England states, New York, Alaska, and Wisconsin) and having legally prescribed limits, powers, and functions. Requirements for incorporation vary widely among the states. Some states have few specific criteria, while others have established population thresholds and occasionally other conditions (minimum land area, population density, and distance from other existing incorporated places) that must be met for incorporation.

A county subdivision occurs when counties (or statistically equivalent entities) are divided into one or more of these geographic units, the two major types of which are minor civil divisions (MCDs) and census county divisions (CCDs). These subdivisions sometimes overlap with or are exactly the same as the area indicated by a CDP.

  • Minor civil divisions (MCDs) are the primary subcounty governmental or administrative units. They have legal boundaries and names as well as governmental functions or administrative purposes specified by state law. The most familiar types of MCDs are towns and townships.
  • Census county divisions (CCDs) are the statistical entities established in the 21 states where MCDs either do not exist or are unsatisfactory for the collection, presentation, and analysis of census statistics. They are designed to represent community areas focused on trading centers or major land use areas. They have visible, permanent, and easily described boundaries.

A neighborhood refers to the perceived boundaries used to distinguish an area with no real jurisdictional boundaries (as opposed to state and city lines, for instance).

A census tract ( is a small, relatively permanent statistical subdivision of a county or statistically equivalent unit, created for data presentation purposes by local census data users or the geographic staff of a regional census center in accordance with Census Bureau guidelines.

These tracts, which can include anywhere between 1,000 and 8,000 people, are designed to be comparatively uniform in terms of population characteristics, economic status, and living conditions at the time of their establishment.

Their boundaries tend to be determined by permanent features, since they are drawn with the intention of maintaining stability over many decades. The boundary of a state or county is always a census tract boundary.

A ZIP code area, all of the addresses that fall under a certain 5-digit ZIP code, is created by the U.S. Postal Service to expedite the delivery of mail. Most ZIP codes do not have specific boundaries, and their implied boundaries do not necessarily follow clearly identifiable map features. One ZIP code area may intertwine with those of one or more other ZIP Codes, making these areas more conceptual than they are geographic.

A school district is a geographic area within which state, county, or local officials or the U.S. Department of Defense provides public educational services for residents. The U.S. Census Bureau provides data for three types of school districts: elementary, secondary, and unified.

This overview of basic data terms and the way in which they are applied can act as a glossary of sorts when engaging in dialogue about Web-based real estate. Having an understanding of this “language” equips a client seeking a data provider with the knowledge of what exactly is supplied by these companies. Still, in order to determine which of these providers will be the best, a client needs to be able to discern what makes that provider’s data better than the rest. Such necessity leads to yet another even more important question: What is “quality” data?