Here’s where I’d like to add those little tidbits of information that typically come up on data quality projects. In an effort to produce more content, I’ve decided to write these tidbits in a more informal manner. If there is an interest in more details, please comment and I can follow up with a deeper dive.
What’s an Orphan? An orphan transaction is a transaction in a “child” table without an associated transaction in a “parent” table. For instance, an address record in the Address table without a link to a customer record in the Customer table. Orphaned records lead to various issues in business operations like marketing and business analytics. Challenges orphans pose to the business An address without an associated customer record will pose a challenge to marketing initiatives when trying to engage the customer. How do you address the marketing mailer, determine previous purchasing history or determine gender based segmentation? In order to confront this issue it is crucial to understand if you have applications that are creating orphans. There are various ways to conduct this analysis manually with SQL or in an automated fashion using data quality tools. Orphan Analysis Methods In order to manually investigate orphans a series of SQL queries need to be created. Since these queries need to be executed periodically, it is most likely best if they are setup as stored procedures. For the purposes of this post, let’s just concentrate on the substance of the queries. Orphans are all about linkages, so the basic information required is to [...]
Introduction Data integration is nothing new. Since the concept of data warehousing, data integration has been a major initiative for most large organizations. On the most common obstacles of integrating data into a warehouse has been the fact that assumptions about the state of the source data have been either false or flawed at best. One of the reasons for this is that very little investigation, or data profiling, is performed on the source data prior to design and execution of the data extraction routines. With all there is to plan for and deliver on data warehousing projects, this oversight is understandable. However, I believe, it is time for data quality to assume the role of reporting on and remdiating the current state of all source data to be migrated into a warehouse. Turning assumptions into facts If source data was profiled what was once assumptions about data can be transformed into facts about the state of the data . Data profiling consists of scanning data and typcially delivers measures such as the frequency of nulls, the extent to which data is unique, and ranges of actual values within each fields included. With data quality tools such as Informatica’s Data [...]
I’ve started using the date related functions in the data quality developer tool. I’ve found some fun ways to implement them and wanted to share. Is_Date Before you use any date function you need to be sure you’re dealing with a date string. The Is_Date function, available in the Expression transform, is how you test a string for proper date format. The syntax is simple, Is_Date(Input Port). The result is 1 if the string is indeed a date and 0 if it is not a date string. Refer to the graphic below for an illustration of how to implement the Is_Date function. To_Date If the result of the IsDate test is a negative one, you’ll need to convert the string to a date to use the other date functions. In order to do this you’ll need to use the To_Date function. Again the syntax is fairly simple, To_Date(Input Port, ‘DATE FORMAT’). Refer to the graphic below for an illustration of how to implement the To_Date function. Date_Diff Now that you know you have date strings to analyze you can move on to more advanced date functions. In many cases, the time period between dates is a critical metric when performing [...]
Why? There are plenty of aspects of address validation to write about. Validating addresses can be done with many different tools, each with their own specific details on how to do it. There are various ways to validate address within each tool to produce different outcomes. And there are various ways to manage and integrate this data back into enterprise, operational systems. While all of this content is very helpful and important to convey, it is my belief that it is essential to understand, write about and discuss why address validation is important to an organization and their master data management (MDM) efforts. Here are a few reasons why address validation is so important to any organization (feel free to comment and suggest others!): Without valid address information, return mail can impact an organizations bulk mail status and lead to increased mailing costs Without valid address information, billing operations generate negatively impacting cycles of billing collections and corrections which has a negative impact on revenue assurance Without valid address information, marketing campaigns are not fully leveraged Without valid address information, marketing techniques like house-holding cannot fully realize their potential Without valid address information, customer care operations are impeded Without valid address information, customer [...]
Why? Address information, in particular customer address information, is a core asset of any business. It plays a pivotal role in two fundamental business operations; revenue assurance and revenue generation. Without valid, deliverable customer address information collecting payment for services or products is often a process that, at best, requires repetitive efforts that cost the business labor and resources (and dollars). At worst, the process fails to collect, creating an obvious issue costing time and resources (and dollars). Without valid, deliverable customer address information marketing to existing and potential customers is not possible and will, again, cost the business labor and resources (and dollars). What? So what exactly needs to be validated in order to prevent the failure of revenue collection and generating events? While it is not harmful to have the full compliment of customer address data collected, stored, and validated, there are a few pieces of address information that are essential. Postal Code is an absolute must have in order to ensure the mailer be delivered. Postal Code is the core element that the United States Postal Service uses to route mail. Without it deliverability is unachievable. Street Number and Street Name are also essential pieces of information to [...]
Introduction I’ve recently made the shift from Informatica Data Quality / Data Explorer version 8.x to Informatica v9. In the process I have discovered quite a few shifts in how certain tasks are performed. Notice I called them shifts. I did so because after learning them, I didn’t feel as though there were fundamental changes. A prime example and the focus of this post is the source-to-target process. This process is so fundamental to ETL and data quality that it is often the starting point for all development. As such I have chosen it as my starting point in mapping out the shifts in v9. Shifting from the start … In IDQ v8.x there was a source component that you added to the plan and defined either with custom SQL or by selecting the required data attributes. In v9 you’ll have a hard time finding that component. I know I did. That’s because a shift has been made by Informatica away from source/target components and toward physical / logical data objects. For simplicity purposes we’ll use physical data objects in our example. However a logical data object is a construct that merits mention here and I’ll follow up with in a later posting. Logically speaking Briefly, [...]
I’ve been in the clouds lately, in more ways than one. I’ve been on the road performing another data quality assessment on an island in the Pacific. This translates into the fact that I’m gaining status on multiple airlines and becoming increasingly appreciative of noise canceling technologies.
I’m also gaining an appreciation for another technology, cloud based data quality solutions! I am leveraging Informatica’s latest data quality platform, IDQ v9. IDQ v9 brings to mind a favorite 80′s commercial of mine where peanut butter and chocolate are combined into one tasty treat! For sure there is a little PowerCenter in your Data Quality and a little Data Quality in your PowerCenter …
Introduction While I am doing research on my next matching algorithm post, the Jaro-Winkler algorithm, I have decided to throw together some of my favorite “lessons learned” which I have discovered during my practice with Informatica Data Quality (IDQ) Workbench. This eclectic bunch of tricks has helped me carry out various tasks such as more comprehensive data profiling and more accurate matching. Some of the tips are generic and apply to any data quality software; however, some require detailed knowledge of IDQ. If you’d like a more details about the tips below, please feel free to leave a comment and I’ll follow-up with you directly. Data profiling and the use of delimiters One of the most valuable profiling and analysis components Informatica supplies is the Token Labeler. This component categorizes the data format of the values in a given field. It does so through the use of tokens and delimiters. These tokens are essentially labels that will be assigned to each string in a field that are separated by a delimiter. The significance of the delimiter is that it is often an indication of the “noise” characters that need to be removed from or converted in the data. For instance, on a recent engagement [...]
Among Richard Hamming’s many accomplishments is the development of an algorithm to compare various types of strings of the same length to determine how different they are. Due to the requirement of equal length, the algorithm is primarily used to detect differences in numeric strings but can be used with textual data as well.
Informatica has incorporated the Hamming algorithm into the data quality workbench tool in order to produce a match score. The Hamming component requires the selection of at least two inputs, it can be configured to handle data with nulls and will output a match score. In IDQ a Hamming match score of one (1) indicates a perfect match while a Hamming match score of zero (0) indicates that there was no correlation between the two values being analyzed.
I’ve used the Hamming component in IDQ to analyze match possibilities in telephone numbers and postal codes. I’ve found it to be reliable in detecting true positive matches and sensitive enough to detect even slight differences (as indicated in the sample data above). I hope this review will help those of you interested in using the Hamming component in IDQ or those just interested in developing knowledge of the algorithm.
I’d like to begin a multi-part series of postings were I detail the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post I’ll start by giving a quick overview of the algorithms available and some typical uses for each. In subsequent postings I’ll get more detailed and outline the math behind the algorithm. Finally I’d like to finish up with some baseline comparisons using a single set of data. IDQ Workbench enables the data quality professional to select from several algorithms in order to perform matching analysis. Each of these serve a different purpose or are tailored toward a specific type of matching. These algorithms include the following: Hamming Distance Jaro-Winkler Edit Distance Bigram or Bigram frequency Let’s look at the differences and main purpose for each of these algorithms. The Hamming distance algorithm, for instance, is particularly useful when the position of the characters in the string are important. Examples of such strings are telephone numbers, dates and postal codes. The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other. The Jaro-Winkler algorithm is well suited for matching strings where [...]
Recently on a data quality project I stumbled across an anomoly that I thought I share with the data quality / Informatica community. It involves the use of Informatica Data Quality (IDQ) and the use of certain types of queries.
With these basic switches you can deploy any IDQ plan regardless of the query required to source the data. I hope this post helps someone avoid hours of debugging!