Handling Duplicate Images

Handling duplicate images involves marking them for deletion and transferring their references to a master image that is retained in the system. The stages for handling duplicates are the same when using the auto-handling method and when using the clerical review workflow method. The difference is that for auto-handling, all action is taken without user interaction; for clerical review, a user can manually override the system actions.

A combination of the two methods provides the most effective means of identifying and removing duplicate images, as defined in the Deduplication Strategy section below.

For both methods, the complete process is defined below, and involves the following stages:

  1. Preparing for deduplication
  2. Identifying duplicates
  3. Selecting the master
  4. Processing images
  5. Troubleshooting errors

Deduplication Strategy

The most effective means of identifying and removing duplicate images involves using both auto-handing and the clerical review methods. Using this strategy, pixel-to-pixel matches are identified and automatically handled first, leaving less obvious potential duplicates to be handled manually by a user.

Important: As with any deduplication task aimed to delete redundant data, it is vital to first thoroughly test the process on a non-production system, such as a test environment. Metadata can and intentionally will be lost as a result of the deduplication handling process. There is no undo option, nor is there a recovery function. While restoring from a backup can be acceptable in a test environment, it is likely to cause an unacceptable amount of lost data in a production system.

For the initial deduplication run, set the configuration 'Auto-Handling Threshold' parameter to 'Yes' and the 'Clerical Review Threshold' parameter to 'No Clerical Review.' With this configuration, since auto-handling only considers pixel-to-pixel matches, from the set of potential duplicates, the system selects a master image and every other image in the group is compared to that master. If all images match the master pixel-to-pixel, then all images are auto-handled. If more than one image does not match the master pixel-to-pixel, then all are sent to clerical review. This configuration is intended to handle the bulk of the pixel-to-pixel matches up front, reducing the number of images for an end user to process in clerical review. However, as pixel-to-pixel matches are only identified relative to the selected master of the group, it is possible that some subsets of identical images will not be found by this method (for example, two identical images in a larger group will not be auto-handled if neither is a match to the master).

Modify the configuration for subsequent runs with the 'Auto-Handling Threshold' parameter set to 'Yes' and the 'Clerical Review Threshold' parameter set to 'Near Matches.' The process described above will still take place, but with this configuration, the group of images that are determined to be very close to the master will be sent to clerical review. In subsequent runs, the master from the auto-handled group will likely be grouped with the master from the clerical review group for further comparison.

When the configuration no longer produces groups of potential duplicates, consider modifying the 'Clerical Review Threshold' parameter to consider less than near matches and further reduce potential duplicate images.

Important: Once an image is marked as a duplicate (its 'Deduplication Delete Flag' metadata attribute is set to 'true') it is ignored by the deduplication functionality, and the final processing should be performed manually. That may include using a workflow to verify and then delete it from STEP, or move it to a hierarchy node outside of the one selected in the configuration, or searching to find all images marked for deletion and then deleting them from STEP as a group. The final processing should also include removing the IDs of the deleted images from the 'Confirmed Duplicates' metadata attribute.

Example

To illustrate this strategy, consider that images 1-3 are identified as a potential duplicate group, and image 1 is selected as the master. Image 1 is a pixel-to-pixel match to images 2 and 3, so images 2 and 3 will be automatically confirmed as duplicates, marked for deletion, and have their references moved to the master. Next, images 4-6 are identified as a potential duplicate group, and image 4 is selected as the master. Images 4 and 6 are not pixel-to-pixel matches with image 4, so they will be sent as a group to clerical review and a master within the group will be selected, for example image 4. Images 5-6 will be marked as duplicate or non-duplicate based on the user selections, and confirmed duplicates will be handled the same as described for the auto-handling scenario. In a subsequent deduplication run, confirmed duplicates are not considered, but the two masters from a previously split group (images 1 and 4 in this example) may be presented for clerical review against one another.

Preparing for Deduplication

The foundation of the deduplication process uses perceptual hashing, which produces a numeric string representing each image, known as the pHash. The pHash values of images are compared to determine their Hamming distance, which is the number of positions in the string at which the numbers differ. A Hamming distance of zero does not necessarily mean that two images are identical, but it does indicate that they are likely quite similar. Before duplicates can be identified, a pHash value must be assigned to the images that will be evaluated. For more information on pHash, search the web.

For more information, see the Preparing Images for Deduplication section of the Running the Image Deduplication Process topic here.

Identifying Duplicates

The premise of the deduplication algorithm is 'when images look the same, they are the same.' This definition allows for you to determine a level of variation that is acceptable, while potentially sending variations outside that range to the clerical review workflow.

Only elements that can be visually observed affect the outcome of the algorithm. Non-observable ways to compare images do not affect the outcome of the algorithm, such as STEP metadata on the asset object (description attributes), keywords, EXIF, or other embedded data (like photographer or location). Images that appear identical but use different color models (CMYK and RGB) will likely be sent to clerical review (if enabled).

When setting up an image deduplication configuration, the Hamming Distance is taken into account by both the 'Auto-Handling Threshold' and the 'Clerical Review Threshold' parameters. These parameters work together to determine how duplicates are identified and processed. The possible settings are defined in the Threshold Settings section of the Creating an Image Deduplication Configuration topic here. For more information on Hamming Distance, search the web.

For the clerical review process, the user manually selects duplicate images as defined in the Managing Duplicates section of the Using Image Deduplication Clerical Review topic here.

For the auto-handling process, duplicates are images with a pHash and that match the master pixel-to-pixel.

Results

When the image deduplication process completes successfully the following updates are made to a duplicate image:

Important: Once an image is marked as a duplicate (its 'Deduplication Delete Flag' metadata attribute is set to 'true') it is ignored by the deduplication functionality, and the final processing should be performed manually. That may include using a workflow to verify and then delete it from STEP, or move it to a hierarchy node outside of the one selected in the configuration, or searching to find all images marked for deletion and then deleting them from STEP as a group. The final processing should also include removing the IDs of the deleted images from the 'Confirmed Duplicates' metadata attribute.

Selecting the Master

The system selects a 'master' image based on the evaluation criteria defined below. The master is the image that should be kept and be updated with classification and product references from the duplicates. If a single image cannot be determined as the master (because multiple images meet the criteria), one is selected at random from the images that remain after the last criteria is evaluated. For details, see the Managing Duplicates section of the Using Image Deduplication Clerical Review topic here.

When possible, the auto-handling process selects a single master image based on the following evaluation criteria. When no single image can be selected, the image set is sent to clerical review so the user can manually confirm or override the selected master.

Evaluation Criteria for Auto-Handling Master Selection

The evaluation criteria uses the following checkpoints, in the order defined, in an attempt to find the image where the most information is retained.

For reference, 'lossy' = JPEG and 'non-lossy' = TIFF, PNG, EPS (assuming the TIFF images are not stored using JPEG compression).

For example, generally the most information is indicated by the largest image in terms of pixels. But if there is a non-lossy image format that is greater than 80% as large as a lossy image format, the non-lossy is prioritized over an absolute pixel size. If that fails to lead to a unique master image, the color depth is considered, with a preference for keeping the larger depth. Finally, if that fails to lead to a master image, the color space is considered, knowing that RGB is a larger space than CMYK, so the RGB image has priority.

  1. Find the subset of assets in the set that have the highest pixel count (height x width)
  1. From the set of candidate assets remaining after criteria number 1 is evaluated, find the subset of assets with the highest color depth.
  1. Sort the remaining set of assets after criteria number 2 is evaluated by color space, with RGB > CMYK.

Results

When the image deduplication process completes successfully, the master image is updated as follows:

Processing Images

Once a master image and the duplicates are identified, and the image deduplication process completes successfully, the system updates the metadata attributes on the images and moves product-to-asset and product-to-classification references from the duplicates to the master. Moving references / links allows the duplicates to be deleted without losing reference / link data.

Important: Metadata attributes on images hold IDs of confirmed duplicates and confirmed non-duplicates. Modifying these attribute values will cause errors with future image deduplication comparisons.

If images being processed by image deduplication are in more than one classification, or if an image is moved while included in a image deduplication workflow task, there can be impacts outside of the selected classification. When deduplication is run, any tasks in the workflow where the system-selected master is child to the selected classification of the image deduplication configuration will have those tasks removed from the workflow.

Configuration

To ensure the best performance when writing values to the confirmed duplicate metadata attribute, the maximum number of values that will be written is limited to 3,000 by default. When the number of values exceeds the limit, the image is filtered out of future processing. For example, with the default limit, an image that already displays 3,000 confirmed duplicate IDs is no longer evaluated during image deduplication.

Important: Once an image is marked as a duplicate (its 'Deduplication Delete Flag' metadata attribute is set to 'true') it is ignored by the deduplication functionality, and the final processing should be performed manually. That may include using a workflow to verify and then delete it from STEP, or move it to a hierarchy node outside of the one selected in the configuration, or searching to find all images marked for deletion and then deleting them from STEP as a group. The final processing should also include removing the IDs of the deleted images from the 'Confirmed Duplicates' metadata attribute.

Increasing the maximum number of values decreases performance. However, the default can be changed via the sharedconfig.properties file on the STEP application server using the case-sensitive ImageDeduplication.ImageDeduplicationDuplicateAttributesValuesMax property, up to a maximum size of 30,000. When this property is absent from the file, the default is used. Any number entered above 30,000 is ignored and the 3,000 max is used.

For example, you could use the following text to increase the limit to 4,000:

ImageDeduplication.ImageDeduplicationDuplicateAttributesValuesMax = 4000

When an image is filtered out due to the number of values being exceeded, an message is included in the execution report and in the logs with the following text:

The image with ID [Asset ID] has been excluded from the deduplication process as it has exceeded the max number of values set by the ImageDeduplication.ImageDeduplicationDuplicateAttributesValuesMax property for the number of confirmed duplicates. Please clean up confirmed duplicate data by removing the IDs of previously handled confirmed duplicates, or increase the maximum values allowed for the confirmed duplicates attribute.

Results

When the process completes successfully, the user will notice that the metadata and references have been updated.

Important: This handling may result in loss of data from duplicate asset objects, for example, metadata on the asset, or metadata on references to or from a duplicate asset.

Images identified as duplicates are handled as follows:

Important: All changes made by the handling process are auto-approved, resulting in partial approval for products and images. Depending on the settings in relevant OIEPs, these partial approvals can generate a large number of events.

All images handled are recorded in the step.0 log, which can be accessed via the STEP System Administration link on the STEP Start Page. The log includes errors due to conflicts that cause the deduplication process to fail and allows a user to identify issues so that a manual resolution can be provided. For more information, see the Administration Portal documentation here.

Troubleshooting Errors

The first error encountered by the deduplication process causes the processing to stop for the group, while the overall process continues. Within the group that includes an error, all handling is rolled back and the group is sent to clerical review (or remains in clerical review if that is were the error occurred).

Errors are stored in the workflow variable 'ImageDeduplicationErrors' and are reported differently, based on their location: