Selecting Match Codes
Designing match criteria for a deduplication strategy requires an intimate understanding of the data, and to that end, STEP Data Profiles can be of great assistance. Data profiles show the extent to which relevant attributes are populated and highlight the most frequent and rare values and patterns. For more information, see the Data Profiling documentation here.
Prerequisites
Configure the Matching component model as defined in Configuring Matching Component Model topic here.
Data Profile Analysis
In the following example, a profile is generated from the 'External Products' node to find missing values for both OEM and OEM Part Number. This ability to highlight missing values should be accounted for in the deduplication strategy. Furthermore, as illustrated below, the profile shows that the OEM values include obvious duplicates like 'Craft Parts' / 'Craft parts' and 'Weller' / 'WELLER INC,' indicating that some form of normalization is required.
For OEM Part Number, there are more than one hundred distinct values, and thus, the data profile in workbench does not provide exact statistics with the default settings. In Web UI, the exact statistics could be found, but that is actually not necessary in this case. It is possible to see that both uppercase and lowercase letters are used, and that punctuation is used in some values and not in others. Again, this indicates that normalization will be required. Still, it is possible to see that both uppercase and lowercase letters are used, and that punctuation is used in some values and not in others. Again, this indicates that normalization will be required.
Notice, that when looking at the frequent patterns info, there are no clear, distinct patterns in the values.
With two 'matching' attributes, it would be possible to generate two match codes per object, but for this case, this is likely not the best strategy because the number of different OEM values is quite low, especially if they are normalized. Further, comparing all items from the same OEM would result in too many comparisons.
As there is a significant spread in OEM Part Number values, generating match codes based solely on these values could work. Additionally, the OEM value should be used as a basis for matching since a specific OEM Part Number value pattern to an OEM cannot be assumed. For example, a match on OEM Part Number is not necessarily a true match as these values are reused. However, this approach would require that the matching algorithm logic inspect the OEMs later to determine if there was a match or not.
A possible solution is to normalize the values before generating the match codes, and generate composite match codes that include information from both attributes.