This is a personal weblog. The opinions and information distributed represent my own and not those of my employer.
In addition, my opinions change from time to time…I consider it a reflection of learning over time.
This weblog is intended to provide a summary of my experiences, and as such, information and opinions expressed within out-of-date posts, blurbs or editorials may have changed at my discretion or convincing from others who are more “in-the-know”.
If I have offended you with what I have written … maybe that’s a good thing? It proves someone is reading this stuff. If I have really offended you … really? It’s a blog. Chill out.
If I have made you smile, laugh, or even think a little deeper … that’s great. I am glad I have helped someone before I retire.
In my opinion, profiling and scoring data is a fundamental part of a sound data quality assessment. I routinely use these processes to build my “current state” report for clients. I recently used Informatica’s Data Quality developer and analyst tools to put together such a package. I am of the opinion that these tools represent the “best in breed” available to do so. The learning curve is not steep, the functionality is easy to implement and, perhaps most of all, the solution is comprehensive. In a matter of hours you go from raw data to a management reporting dashboard. If you’ve used Informatica or another tool, let me here your thoughts … (leave a comment) Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org
Soundex is a useful function for performing data matching While you can use a Soundex function in the process of identifying potential duplicate strings, I don’t recommend it. Here’s why … The algorithm encodes consonants Vowels will not be encoded unless it is the first letter Consonants to the right of a vowel are not coded Similar sounding consonants share the same digit C,G,J,K,Q,S,X,Z are all encoded with the same digit To illustrate the impact of this type of encoding let’s look at an example of soundex codes for deviations of my first name, William. As you can see from the brief example above, Soundex codes fall short of matching like strings. One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William. Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William. I plan to dig deeper into Soundex functions and their applicability in a future blog post. In the meantime, I wanted to get this observation of mine out there for public consumption. Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org
Data Quality is not a technology issue, it’s a business issue Here is my opinion on why people think it is about technology. Business initiatives like MDM/BI/DQ and the like are being presented, sold on, and driven by technology experts. Information technology has carried business forward to the point where we are the chauffeurs for change and progress. Without the ability to integrate new technologies into a business, the business fails. In this way, I believe that these disciplines are about technology. To me business issues are sales, budgeting, customer service and marketing. Everything else like operations and reporting are so fused with technology now that they are, in a sense, a technology issue. Let me take this theory of mine for a spin in that context. Let’s say we are driving, or “chauffeuring”, a business executive towards a list of master product entries. Does he/she know these entries off the top of their head? Most likely not. How would you “steer” them in the right direction? Probably by querying the data and beginning with a list of values? The decision of which ones are selected might even require a count of popular values? In this way, technology is [...]
How should data ownership be addressed? In my opinion a governance committee is the best option. There should be at least one, probably two representatives from the business, from technology and from budgeting. I’d suggested budgeting be the head of the committee so that solid cost-based decisions can be made. Business and technology can present their case for why money should/should not get spent on a data management issue. This content originally appeared as a comment(s) in reference to a blog post by Charles Blyth here. Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org
@jschwa1 Data cleansing every 3 months? http://ow.ly/1i0vd - Someones not addressing the right problem! This is a clip from a recent tweet from Julian Schwarzenbach of Data and Process Advantage Limited (DPA). My response to his tweet was “ I can see validity [of quarterly cleansing] esp. if the data is from external sources like customers”. I can see where Julian and others might see quarterly cleansing as a lack of attention to the main issue. His assertion is that if you need to cleanse your data every three months, maybe you have other issues you could address “up-stream” that would alleviate the need to perform cleansing so often. I want to say that I completely agree with this especially when the data is created, maintained and distribute within an organization. However, there are quite a few occasions when data is not created or even maintained “in-house” and in this situation it is a good practice to cleanse this data at practical intervals. An example of data created from outside the organization is customer data. Frequently this data is entered directly by the customer into a database from web-enabled order entry and customer service forms. Julian would interject to say increase data quality validation on the web [...]
Software as a Service (SaaS) will help proliferate data quality solutions I agree with this assertion for a few reasons, not the least of which is the ease at which “front-end” data quality solutions will be included in the suite of services in a Service Oriented Architecture (SOA). In my opinion, data qualities true promise lies in a DQ service that can be integrated into any SOA. Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org
I have had conversations recently with fellow data quality gurus which centered around DQ ROI. We all know how important it is to tie a DQ initiative to a return on the investment. This is even more true of an initiative with long-term implementation objectives. During the course of the conversation I pointed out that I believe DQ ROI is all about validating addresses and consolidating duplicates and there seemed to be a cathartic agreement that made us all feel like we weren’t crazy (even if it was only a brief feeling of sanity). Address validation provides a return by increasing revenue assurance and target marketing delivery. In short, mailing to a valid and deliverable address shortens the bill to cash cycle. In addition, it provides a cost avoidance on return mail charges and provides assurance on bulk mail delivery status. Address validation also increases the potential for and accuracy of house-holding efforts which can significantly reduce marketing initiatives. Duplicate consolidation has a similar effect on cost which in turn provides a return on investment. Consolidating duplicates reduces billing errors incurred due to discrepancies between customer data (duplicate records does not always mean exactly the same data). It also reduces the number of marketing [...]
Introduction Recently on a data discovery project I observed something that I wanted to share. Data discovery efforts, and the tools that support them, are well suited for those organizations who’ve had data explosive growth. With this kind of growth the data landscape expands to the point where in-depth knowledge of data, and more importantly metadata, details becomes unobtainable. This is where a product suite like Global IDs data transparency suite can enable effective data management strategies. Data Transparency What’s in a data transparency suite? The GIDS Data Transparency Product Suite is a suite a of 15 applications that provides companies with a broad set of capabilities to scan and inventory their data landscape. Using these applications, organizations can perform the following tasks. Scan their data environment (structured data, un-structured data, semi-structured data) Create and populate a Metadata Repository that can be searched by business and technical users Profile their structured databases to create a semantic understanding of their data I can speak from experience when I say that these three functions present a complete picture of a data landscape. With metadata, profiling results and sematic taxonomies, a master data management / data quality / data governace solution is within reach. Now I’m [...]
Recently a reader, Richard Ordowich, posted this resource in a comment so I thought I’d pass it along. The most comprehensive list I have seen is in the book; Managing Information Quality by Martin Eppler in which he lists 70 typical information quality criteria which was compiled from various other sources (and referenced). Thanks for taking the time to visit the weblog!William Sharpsharp@thedataqualitychronicle.org