Configuring Match Codes
Before configuring match codes there are several things to consider:
- Closely examine the data before configuring a match code. The data profiling tool can provide a lot of valuable information, and if you are planning to use a specific attribute in the match code, always check to which degree the attribute is populated – if values are missing on a lot of objects, it is likely not a good candidate, or at least should not be used alone, as objects with 'empty' match codes will not be included in the database table.
- When working with match codes composed from several pieces of data, always put the most significant data first. For instance, if deduplicating address objects, put the ZIP code before street and street number, as ZIP codes are geographic, standardized, and mutually exclusive – which most effectively separates your addresses into discrete objects.
- Be sure to normalize the data used in match codes. If, for instance, a manufacturer name is often abbreviated, your match code definition should handle this so that the name is represented the same way in the match codes, regardless of whether it is abbreviated on the source object or not.
- The match code can be just a single piece of data like an EAN number. Furthermore, if you are only interested in comparing objects that have identical EAN numbers, a Window Size of 1 can be used. This means that only objects with identical Match Code values will be compared.
- Several match codes can be generated per source object. STEP functions can resolve to a list of multiple match codes, and in JavaScript, an array can be returned. In both cases, each element will be a separate match code. As a simple example, this could be useful if you were to identify duplicates among customer entity objects, each having a name and an address attribute. The match code could be a concatenation of address and name, but with this approach, you would not be able to find duplicates for customers who have moved, as the match codes would likely be placed too far from each other. Instead, each object could be represented with two match codes: one for 'Name' and one for 'Address', meaning that the objects could be compared both due to having similar names – and having similar addresses (a hardcoded prefix should be added first to prevent comparisons across the two domains).
- Ideally, you should generate match codes that allow you to perform matching with a Window Size of 1, but where there are still not too many objects that share the same match code.
Note: Some of the considerations listed may not be relevant for all solutions.
For more information on match codes, see the Matching, Linking, and Merging Elements documentation here.
For more information on data profiling, see the Data Profiles section of the Data Profiling documentation here.
For the automated process as part of an embedded matching algorithm, see the Configuring a Match Code Generator topic in this documentation here.
For the manual process, see the Configuring a Manual Match Code topic in this documentation here.