March 2, 2010
By William Sharp

This is a personal weblog. The opinions and information distributed represent my own and not those of my employer.

In addition, my opinions change from time to time…I consider it a reflection of learning over time.

This weblog is intended to provide a summary of my experiences, and as such, information and opinions expressed within out-of-date posts, blurbs or editorials may have changed at my discretion or convincing from others who are more “in-the-know”.

If I have offended you with what I have written … maybe that’s a good thing?  It proves someone is reading this stuff.  If I have really offended you … really?  It’s a blog.  Chill out.

If I have made you smile, laugh, or even think a little deeper … that’s great.  I am glad I have helped someone before I retire.

Data Profiling & Scorecarding with Informatica Data Quality

In my opinion, profiling and scoring data is a fundamental part of a sound data quality assessment.  I routinely use these processes to build my “current state” report for clients.  I recently used Informatica’s Data Quality developer and analyst tools to put together such a package.  I am of the opinion that these tools represent the “best in breed” available to do so.  The learning curve is not steep, the functionality is easy to implement and, perhaps most of all, the solution is comprehensive.  In a matter of hours you go from raw data to a management reporting dashboard. If you’ve used Informatica or another tool, let me here your thoughts … (leave a comment) Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org

Soundex for String Matching

Soundex is a useful function for performing data matching While you can use a Soundex function in the process of identifying potential duplicate strings, I don’t recommend it.  Here’s why … The algorithm encodes consonants Vowels will not be encoded unless it is the first letter Consonants to the right of a vowel are not coded Similar sounding consonants share the same digit C,G,J,K,Q,S,X,Z are all encoded with the same digit To illustrate the impact of this type of encoding let’s look at an example of soundex codes for deviations of my first name, William.  As you can see from the brief example above, Soundex codes fall short of matching like strings.  One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William.  Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.  I plan to dig deeper into Soundex functions and their applicability in a future blog post.  In the meantime, I wanted to get this observation of mine out there for public consumption. Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org

Data Quality: where does it belong?

  Data Quality is not a technology issue, it’s a business issue Here is my opinion on why people think it is about technology. Business initiatives like MDM/BI/DQ and the like are being presented, sold on, and driven by technology experts. Information technology has carried business forward to the point where we are the chauffeurs for change and progress. Without the ability to integrate new technologies into a business, the business fails.  In this way, I believe that these disciplines are about technology. To me business issues are sales, budgeting, customer service and marketing. Everything else like operations and reporting are so fused with technology now that they are, in a sense, a technology issue. Let me take this theory of mine for a spin in that context. Let’s say we are driving, or “chauffeuring”, a business executive towards a list of master product entries. Does he/she know these entries off the top of their head? Most likely not. How would you “steer” them in the right direction? Probably by querying the data and beginning with a list of values? The decision of which ones are selected might even require a count of popular values? In this way, technology is [...]

Data Quality: to whom does it belong?

 How should data ownership be addressed? In my opinion a governance committee is the best option.  There should be at least one, probably two representatives from the business, from technology and from budgeting.  I’d suggested budgeting be the head of the committee so that solid cost-based decisions can be made.  Business and technology can present their case for why money should/should not get spent on a data management issue. This content originally appeared as a comment(s) in reference to a blog post by Charles Blyth here. Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org

Data Cleansing every quarter?

@jschwa1 Data cleansing every 3 months? http://ow.ly/1i0vd - Someones not addressing the right problem! This is a clip from a recent tweet from Julian Schwarzenbach of Data and Process Advantage Limited (DPA).  My response to his tweet was “ I can see validity [of quarterly cleansing] esp. if the data is from external sources like customers”.  I can see where Julian and others might see quarterly cleansing as a lack of attention to the main issue.  His assertion is that if you need to cleanse your data every three months, maybe you have other issues you could address “up-stream” that would alleviate the need to perform cleansing so often.  I want to say that I completely agree with this especially when the data is created, maintained and distribute within an organization.  However, there are quite a few occasions when data is not created or even maintained “in-house” and in this situation it is a good practice to cleanse this data at practical intervals. An example of data created from outside the organization is customer data.  Frequently this data is entered directly by the customer into a database from web-enabled order entry and customer service forms.  Julian would interject to say increase data quality validation on the web [...]

Data Quality & Cloud-based services

Software as a Service (SaaS) will help proliferate data quality solutions I agree with this assertion for a few reasons, not the least of which is the ease at which “front-end” data quality solutions will be included in the suite of services in a Service Oriented Architecture (SOA). In my opinion, data qualities true promise lies in a DQ service that can be integrated into any SOA. Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org

Data Quality ROI = Address Validation and Duplication Consolidation

Data Quality ROI = Address Validation and Duplication Consolidation

  I have had conversations recently with fellow data quality gurus which centered around DQ ROI.  We all know how important it is to tie a DQ initiative to a return on the investment.  This is even more true of an initiative with long-term implementation objectives.  During the course of the conversation I pointed out that I believe DQ ROI is all about validating addresses and consolidating duplicates and there seemed to be a cathartic agreement that made us all feel like we weren’t crazy (even if it was only a brief feeling of sanity). Address validation provides a return by increasing revenue assurance and target marketing delivery.  In short, mailing to a valid and deliverable address shortens the bill to cash cycle.  In addition, it provides a cost avoidance on return mail charges and provides assurance on bulk mail delivery status.  Address validation also increases the potential for and accuracy of house-holding efforts which can significantly reduce marketing initiatives. Duplicate consolidation has a similar effect on cost which in turn provides a return on investment.  Consolidating duplicates reduces billing errors incurred due to discrepancies between customer data (duplicate records does not always mean exactly the same data).  It also reduces the number of marketing [...]

Data Discovery. The first step toward data management.

Data Discovery. The first step toward data management.

Introduction Recently on a data discovery project I observed something that I wanted to share.  Data discovery efforts, and the tools that support them, are well suited for those organizations who’ve had data explosive growth.  With this kind of growth the data landscape expands to the point where in-depth knowledge of data, and more importantly metadata, details becomes unobtainable.  This is where a product suite like Global IDs data transparency suite can enable effective data management strategies. Data Transparency  What’s in a data transparency suite?  The GIDS Data Transparency Product Suite is a suite a of 15 applications that provides companies with a broad set of capabilities to scan and inventory their data landscape. Using these applications, organizations can perform the following tasks. Scan their data environment (structured data, un-structured data, semi-structured data) Create and populate a Metadata Repository that can be searched by business and technical users Profile their structured databases to create a semantic understanding of their data    I can speak from experience when I say that these three functions present a complete picture of a data landscape.  With metadata, profiling results and sematic taxonomies, a master data management / data quality / data governace solution is within reach.  Now I’m [...]

Data Quality Resource

Data Quality Resource

  Recently a reader, Richard Ordowich, posted this resource in a comment so I thought I’d pass it along. The most comprehensive list I have seen is in the book; Managing Information Quality by Martin Eppler in which he lists 70 typical information quality criteria which was compiled from various other sources (and referenced). Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org


6 Responses to Editorials

  1. Jackie Roberts on March 12, 2010 at 4:32 pm

    William, excellent twitter snippets for discussion!!!

    Data cleansing every quarter – in my world of data cleansing we classify, profile, verify, enrich and translation before the data is exported to set up a material master which naturally feeds downward system streams. We also have maintenance processes to re-verify and audit that product information is current. After a while, the relationship is developed with the team of analysts and the manufacturers / suppliers to provide feedback of manufacturer obsolesce or product updates.

    Data Quality & Cloud-base Services – It is imperative that data cleansing is a critical step at set up. I am very interested in on-going data quality maintenance tools and data reporting. From what I can see, there isn’t much thought of data matching and ease of inconsistent data structuring and reconciling or reporting being addressed in the “Cloud based Services”.

    Data Quality: where does it belong & to whom does it belong? Data Quality and Governance needs to be an enterprise solution with a stirring committee of the cross functional core disciplines represented. A budget is a must to ensure that data governance and data cleansing is a standard business process as data is the foundation of information quality. The enterprise will have cost saving opportunities that will arise out of a cleansed data quality environment that will also require funding to implement streamlined business processes, such as a virtual inventory sharing program, improved processes of data extracts to improve data processing cost or throughput, etc.

  2. William Sharp on March 12, 2010 at 7:40 pm

    Thanks Jackie! Comments are blogging’s sweet reward! I was thinking of you and our recent discussion when I was writing about quarterly data cleansing. I think DQ service advocacy, no matter what periodicity, is good. Like I said, I see Julian’s point of there being a root-cause that is potentially being ignored, however, there are scenarios where the root is outside the organization. Most often it is not possible to require DQ services in this these scenarios.
    As for DQ services in the cloud, Informatica has made strides there. You should check out @infacloud on twitter for more info.
    Thanks again, Jackie. I look forward to more discussions with you about the nitty-gritty of DQ, cloud based or otherwise.

    • Julian Schwarzenbach on March 25, 2010 at 7:59 am


      Unfortunately, the 140 character limit in Twitter means that messages are sometimes truncated or don’t cover every angle. I accept that where data is coming into an organisation from external sources, then you can be less rigorous about validation. I also recognise that where customers enter data that validation may detract from the ‘customer experience’ however, that should not prevent a reasonable level of drop-down lists, check boxes etc. being used for data entry. Many web sites are still over reliant on free text entry, even for standard items such as country codes.

      Even then, I am still not sure that full data cleansing every quarter is the correct answer – for example, if a customer database contains 10 million entries and is growing at a rate of 100,000 entries per month, then once you minimised the likely causes of error, cleansing should only be required on these 100,000 new records. As any other data changes through internal processes should be appropriately controlled and validated, the vast majority of these 10 million records should not need cleansing. Surely, running a full quarterly cleanse will be a waste of business resources? What about BI generated immediately prior to a cleanse cycle, surely this will not be giving the correct answer?

      I appreciate we all have different backgrounds and perspectives, so there may be other things I have missed. However, clients should still make sure they understand why a vendor is suggesting a quarterly data cleanse (because that is in the vendors interest) and check that validation processes are suitable to reduce cleansing to the optimum level.


      • William Sharp on March 25, 2010 at 2:27 pm

        So glad you elaborated on this! And although we’ve privately discussed this, let me state that this editorial quip was not intended to slight you, your firm, or your years of domain expertise. In fact, I often learn and gain new perspectives as I read your writings.
        I am also glad you highlighted something I failed to address; incremental cleansing. I agree that cleansing should only be required incrementally. There is one exception and that is when a new cleaning requirement is developed/discovered.
        Now traditionally cleansing does not infer duplicate consolidation. However, it is worth noting that duplicate consolidation would need to be performed on the entire recordset each quarter. I do feel as though this would be a proper best practice to recommend as well.
        Thanks for the comment, Julian. That is exactly what I am aiming for with this page! I find that healthy, respectful debate is often the path to insight.
        Thanks again!

  3. Ivan Chong on May 16, 2010 at 10:15 pm

    Well written post -thanks for doing a great job of educating. Informatica has customers that derive address cleansing ROI in the way you mention. They measure revenue assurance via DSO and can directly tie reduced DSO to better billing address quality. One customer remarked that customers who never receive invoices tend not to pay their bills.

    Other customers measure duplicates and easily relate those DQ issues to process inefficiency. My favorite example is where a customer measured duplicate inbound invoices for AP. Not surprisingly, vendors do not complained when they receive multiple payments on the same invoice. Our customer saves millions per quarter just by reduplicating their AP records.

  4. Alastair McKeating on August 31, 2010 at 9:56 pm

    I agree that Greenplum will raise a storm of DQ issues and that’s a good thing.Having spent many years as a data architect I think the bane of my existence was debating the one true master record. A federated view is a more accurate reflection of reality where quality can be enhanced in the context of a specific use while a central governance process aggregates the individual “correct in their context” views into a master view (deliberately using the concept of a view rather than the more restrictive concept of a single physical record).

    Also, my understanding of Greenplum is that is emphasizes the value of collaborative technology which may be the better, less formal, more collective way to advise/warn/adapt at risk inconsistencies into any decision made on the basis of said aggregation as a complement to the formal record structure.

Leave a Reply

Your email address will not be published. Required fields are marked *