Data in Real Estate (Part 2): Creating Quality

Quality.jpg

Having established a foundational knowledge of data and its application to the geographic sphere of real estate, the ability to determine what sort of data will be most valuable for a company’s business ventures is even more important. In the most basic sense, data quality refers to the degree of excellence in relation to the portrayal of the geographic “phenomena” being examined, all contributing to the data’s fitness for use.

How can we say what makes “good” data? When you talk about a good book or a good movie, isn’t your judgment dependent on certain subjective qualities — interests, mood — that are individual to you? To a degree, yes, but there are also certain aspects that must be present without fail in order for a book or movie to be considered “quality.” A book must be free of unintentional spelling and grammatical errors, for instance, and a movie needs to have clearly identified characters and some form of plot.

The overall quality of data can be thought of in the same way. While the specifics of what makes good data will vary according to the type of data you’re seeking — real estate data as opposed to sports statistical data, for instance — there are non-negotiable elements that apply to data as a whole.

III. Taking a Closer Look at “Good” Data

Data quality assurance involves verifying the reliability and effectiveness of data. This requires going through the data periodically, typically in order to update, standardize, and remove duplicate records to create a single view of the data, including data stored in multiple systems. There are a lot of words people use to talk about data quality, but they can be broken down into several basic categories.

Timeliness

The frequency with which time-sensitive data is updated is a large contributor to data quality. Data should go through regular, visible updates, which include not only the addition of new data but the circulation of revised data.

The data collected should be reviewed periodically to make sure it is still relevant and fulfills current needs. Based on this examination, the content of data collected could be changed, necessitating collection of new types of data and the discontinuation of others. Things to consider include: Over what period of time was the data collected? When was the data last updated to reflect changes? How long is the data likely to remain current? (GBIF)

Timely data updates are essential to real estate Web sites. Perhaps data updates for information like neighborhood demographics are done on a monthly basis, with higher frequency updates for data about home sales, neighborhood resources, etc. Such updates make possible the highest level of accuracy for all information that is provided to consumers.

But is there a way to be sure that the data you are getting is up-to-date? In home value estimation, for instance, can recent renovations to homes or properties be taken into account? What if the information you gather from various sources contradicts itself? Is there a way to discover which source is most reliable and which is outdated? More in-depth research can be undertaken to corroborate one source or the other (see consistency below). If additional research for consistency proves unsuccessful in clearing up the conflict, trusting the source that has a history of better reliability may be effective.

Accuracy & Completeness

Making collected data as accurate and complete as possible often involves locating missing values in a data set and ensuring the information provided makes sense. The data should also be relevant to the topic of research in order that it may best fulfill a user’s needs.

Data cleansing, sometime called data scrubbing, involves correcting and sometimes removing corrupt or inaccurate records from a data set. The process is used to identify incomplete, incorrect, inaccurate, or irrelevant parts of the data, which can then be replaced, modified or deleted to ensure maximum accuracy. Specifically, data cleansing might include fixing typos or validating and correcting data in question against a known list. The validation may be strict — rejecting any address that does not have a valid postal code — or approximate — correcting records that partially match existing, known records. (Wikipedia)

The simplest kind of data validation verifies that the data provided is from a legitimate set. For example, telephone numbers should include digits and possibly characters such as plus, minus, and parentheses. Higher-level validations might confirm an acceptable country code (the number of digits entered matched the convention for the country or area specified). This kind of data validation ensures that data is reasonable and secure before it is processed.

But what if the information in question — data about schools, homes, businesses, etc. —is very outdated or not available at all? Not every state or county discloses its home sales data, and some rural areas do not have statistics readily available. Rather than letting incorrect information remain, gaps in information should be minimized as best as possible.

For instance, in such cases of item non-response, the U.S. Census Bureau has methods — known as “assignment” and “allocation” — of determining acceptable answers. These can be assembled from similar housing units or the people who provided the original information. Assignment uses logic to ascertain where a response to one question implies the missing response to another question (first name can often be used to assign a value to sex). Allocation uses statistical procedures, such as within-household matrices, to generate missing values. (Census)

Consistency

Data that falls under this designation will be consistent over time, or have variations that are reasonable for its subject matter. Such data should also be consistent across different sources in order to strengthen its reliability.

The view of the data should be clear, unambiguous and consistent; that is, the same data is always found in the same fields designated by pre-established categories, and is therefore easy to find. Data types and subsets should also have the same basic structure and format.

Problems can arise from both incorrect and inconsistent data. Some types of data validation work to ensure consistency. These processes can involve checking entry fields to ensure data in these fields corresponds (if the entered title is “Mr.,” then the entered gender should be “M”).

To provide consistency, data in different systems must also be compared. Though an address may be present in both systems, the data may be represented differently (one system may store customer name in a single Name field as “Doe, John Q,” while another may store customer name in separate fields: First_Name (“John”), Last_Name (“Doe”) and Middle_Name (“Quality”). These representations therefore may need to be converted to a common format in order to be compared and checked for consistency. (Wikipedia)

The use of a data dictionary can be helpful in providing consistency in data. A data dictionary involves creating a standardized list of data entered in the form in which the creator wants to present it. This dictionary is used with a translation table, or data records from all sources in their raw forms, to match up entries and identify exceptions, or data from one source that does not match the dictionary. These exceptions can then be corrected by manually assigning them to a pre-established dictionary entry or added as a new piece of data, therefore avoiding duplications and other errors.

Reputability of Source

The source supplying the data should also be taken into consideration when data is analyzed for quality. A reliable source will often lead to data that meets requirements for timeliness, accuracy and correctness, and consistency.

The way in which the data was collected and its publication date should be publicly stated. Data collection should be impartial and objective, and any internal access or revisions prior to the data’s release should be indicated. If necessary, original and supporting sources of data are documented. Such documentation also generally includes specification of variables used, definitions of variables when appropriate, coverage or population issues, sampling errors, disclosure avoidance rules or techniques, confidentiality constraints, and data collection techniques.

Data that is acquired from a reputable source can also be described as having integrity. This means that the information is protected from unauthorized access and revision, and is therefore not susceptible to corruption or falsification. The surveys conducted by sources considered reputable use methodologies consistent with generally-accepted professional standards. These include statistical design of the survey sample, questionnaire design and testing, data collection, sampling and coverage errors, non-response analysis, imputation of missing data, and weighting and variance estimation.

In some cases, staff at the source will review all data to ensure its accuracy and validity before results are published. (NSF)

Transparency

This designation refers to the inclusion of any identified and reported errors, documentation of validation and quality control procedures, and opportunity to provide feedback. Depending on confidentiality agreements, publication of the methods used to collect and analyze data may also be part of this transparency. Ultimately, transparency’s importance comes in strengthening the reliability of the data in the eyes of those using it. (GBIF)

Organization

Before being presented to users, data must undergo normalization to ensure it is streamlined and contains no redundancies. In addition to such normalization, preventative methods of dealing with errors that may arise in the future must be put into effect to identify and eradicate those errors.

The data must then be packaged and presented in a way that will make sense and be relevant to a consumer. By creating charts and graphs that are well-organized and easily readable, companies can ensure that their information will be comprehensible to potential clients. This applies to standard grammar as well as logical progression and organization. All data should be edited and proofread before it is publicized in order to maximize clarity and coherence. Any charts or tables used to support or integrate the data must do so clearly, and their purpose should be evident from titles and other distinctive labels (unit of measurement, etc.). Scales and other statistical techniques are clearly indicated so as not to be misleading to readers. (NSF)

Ease of navigating information is increased by the incorporation of links to more detailed lists of data and visual aids, such as maps, wherever possible. Allowing clients to have interactive capabilities — customizing their view, for instance — enhances the user interface as well.

IV. Data Practices

Quality data, then, would incorporate conscientious use of the above practices. Rather than relying on a single source to determine authenticity, data should be verified by several reputable sources. The information taken from such sources also needs to be timely, accurate, consistent, and well-organized in order to be considered relevant and useful.

Real estate companies rely on the collection of data to create accurate records of the geographic areas in which their customers are interested. Using demographics data, companies can construct profiles of different residential communities detailing such features as educational attainment and financial income of those currently living in a certain area. School reports can be provided as a result of data gathered from individual school reports and official standardized testing scores.

QA, or quality assurance, involves examining data records to make sure that all entry fields are populated, eliminate duplicate entries, correct misspellings, and identify outliers (abnormally high or low data entries that may be skewing the remaining data).

Real estate data aggregation also makes use of a process known as geocoding that converts geographic data to latitude and longitude points. These points can be matched to an exact address, the street area, or the centroid of a ZIP code area. Geocoding information allows individual locations to be depicted with a mapping application, often in the form of a mashup.

The mashup application brings together data from multiple sources into a single integrated tool. For instance, by combining cartographic data from Google Maps and location information from real estate data, a user can create a new and distinct service not originally provided by either source. The Chicago Police Department utilizes this type of mashup to display crime statistics over different areas of the city.

Conclusion

Identifying and utilizing data that can be considered “quality” is a process vital to the credibility of information a company will provide. Each aspect of quality data is important and though they may be present in varying degrees — up-to-the-minute data may be sacrificed in favor of completeness, for instance — they all need to be taken into account.

Still, it stands to reason that the way in which that data is organized and presented is just as important as the way in which it is collected. A readable and intuitive user interface makes data more accessible to the consumer and therefore simplifies what can be a complex grouping of information. A collection of timely, accurate, reputable, and consistent data will still be ineffective if not then provided to clients in a format that is polished and easily navigable.

Quality data, then, is organization. Having all of the information required to inform clients about community businesses and amenities, for instance, is indispensable, but if your information is not structured in a way that will make sense to a client and allow him/her to engage with it in a meaningful way, the data is useless.

When looking for a data provider, then, it is necessary to take all of this into account. Is the data they will supply updated frequently? Is it accurate and complete? Does it come from reputable sources? Is it consistent over time and across other data sources? But most importantly, does that data provider organize its information in a form that is clear, cohesive, and understandable for those who will interact with it? Data on its own is essential, but its usability is what will ultimately determine its value.

Sources:

http://www.census.gov/quality/P01-0_v1.3_Definition_of_Quality.pdf

http://www.census.gov/acs/www/UseData/sse/ita/ita_def.htm

http://www.census.gov/quality/quality_guidelines.htm

http://www.nsf.gov/policies/nsfinfoqual.pdf

http://www.gbif.org/prog/digit/data_quality/DataQuality

http://en.wikipedia.org/wiki/Data_validation