Matching Algorithm Tuning
Algorithm tuning begins during the build phase of the implementation. Early on in this process, it is common to find numerous invalid matches making it past the auto-merge threshold while valid matches fall short of the clerical review threshold. Thus, the goal of the algorithm tuning sessions is to perfect the matching logic's accuracy so that matches between golden records score within the appropriate thresholds.
Two approaches to match tuning are available:
-
Match Tuning using Match Tuning Configuration (on sample data)
This pre‑import approach enables data owners to tune match criteria using sample datasets before data is loaded into the system. Tuning is performed by manually adjusting matching logic and evaluating the results through basic pair exports. This method is best suited for entity‑based Match & Merge implementations and supports early validation of matching behavior prior to production data loads.
-
Match Tuning using Matching Rulesets (on golden record data)
This approach is currently an early‑adopter capability that allows data owners to manage and maintain match decisions for rule combinations through a dedicated user interface. Super users support this process by configuring matchers and exporting match pair data as needed. Ruleset-based tuning enables incremental adjustment, validation, and activation directly in development or production environments. It relies on statistics‑driven tuning and decision maintenance, and activation requires republishing golden records to apply updated matching logic.
Note: This section focuses on Match Tuning using Match Tuning Configuration (on sample data). For information about Match Tuning using Matching Rulesets, refer to the topic Match Tuning using Matching Rulesets in the Matching, Linking, and Merging documentation.
Considerations
It is expected that anyone working with matching algorithm tuning is familiar with how to create a match tuning configuration. For more information on match tuning and creating a match tuning configuration, refer to the Match Tuning topic section of the Matching, Linking, and Merging documentation.
There are many considerations to take that will improve output when configuring the match tuning process.
-
Match Tuning General Considerations - Broad, conceptual factors to consider before the match tuning process.
-
Match Tuning Pair Export Considerations - Specific to pair exports used for manual or offline confirmation and rejection of matched pairs, before and during the match tuning process.
Process
The matching algorithm tuning process is as follows:
-
Configure: Use a match tuning configuration to generate a data profile. Using this data profile, identify key data points to consider when configuring a baseline algorithm (matching algorithm and match codes).
For more information on match tuning and creating a match tuning configuration, refer to the Match Tuning topic section of the Matching, Linking, and Merging documentation.
-
Generate Sample Pair: Once the baseline algorithm is configured, generate the random sample pair spreadsheet via a match tuning configuration. This baseline configuration is a ‘best-guess’ configuration based on the analysis of the data so far.
Before the sample pair review can begin, the raw data from the output file should be formatted for readability. The sample pair formatter Excel sheet can optionally be used on the output file.
-
Review Sample Pair: Review the sample pairs with the client. Each individual pair gets either a ‘Yes’, ‘No’, or ‘Not Sure’ indicating whether they should be considered the same entity by the algorithm and linked together.
The sample pair review process can be a time-consuming task, but it is critical in getting the algorithm tuned to meet requirements. Typically, review 1,000+ sample pairs each cycle with the stakeholders. For some of the iterations, a pair export may be as large as 1,000 records per percentage points of interest.
Once the random sample pair spreadsheet is generated and formatted, it is vital to review the sample pairs to determine how the algorithm evaluates them. The primary purpose of the review is to assess the confidence of each merge and modify the thresholds if the scores appear inaccurate. During the review process, it is important to consider the following:
-
Each set should be marked with a decision as to whether the records are considered the same entity (based on the data available).
-
It is best to approach this task from a ‘human’ standpoint as opposed to creating logic to help achieve a certain score.
-
This is not a data cleaning task.
The goal by the end of each sample pair review session is to improve the quality of the matches found. It is much easier to identify false positives than false negatives in the pair export. Therefore, it is recommended to start with a broadly defined algorithm and narrow the match criteria during tuning. For more information on false positives and false negatives, refer to the Match Tuning Pair Export Considerations topic.
-
-
Tune Algorithm: Tune the algorithm based on feedback from the sample pair review, and generate a new set of sample pairs based on the updated algorithm. This goal can be achieved by:
-
Adjusting the scoring method and weight of each scored attribute.
-
Adjusting the relative weight of scoring across all the scored attributes.
-
Adjusting the auto-merge and clerical review thresholds.
Repeat steps 2 and 3 for two more cycles (or more, as needed).
-
-
Finalize: Decide on the final auto-merge threshold and clerical review threshold.