Skip to content

Graduate image deduplication to Brain plugin #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

PeterOttoMoeller
Copy link

@PeterOttoMoeller PeterOttoMoeller commented Jul 14, 2025

This takes functionality implemented here: https://github.com/jacobmarks/image-deduplication-plugin/tree/main and makes it part of the brain plugin.

Overview

Find exact (using file hashes) or approximate (using embeddings) image duplicates and display them. Remove all but one image in each group of duplicates ("deduplication").

image

Comments

  • Some of the long variable names are due to the fact that they are directly displayed in the app (like the different views). I thought descriptiveness was more important than conciseness here.
  • The plugin also had functionality to remove all (!) images of a duplicate group, as opposed to removing all but one (“deduplication”). I did not include this here since I did not think it was a relevant use case.
  • The plugin included separate operators to display duplicates. I did not include those here since displaying duplicates can be achieved using the different views generated by the find_..._duplicate_images operators.

General questions

  • Is this the correct repo/place?
  • This works locally, using local source installs of fo, fob, fop – how do I test if this works with the correct build artifacts after regular install?
  • Do I have to test this with fo-e in addition to tests with fo?
  • The brain plugin does not have any tests in this repository – should any be added here?

Functionality that could be added

  • make displaying approximate duplicates nicer such that there is no ambiguity which images belong to which duplicate group (how ambiguous that looks like depends on the dataset and the model used for generating embeddings; one-off observation: clip embeddings seem to produce less ambiguous approximate duplicate clusters than resnet18 embeddings)
  • let user choose which image among approximate duplicates to keep during deduplication (might be impractical for any dataset larger than a toy example)
  • make user aware of computation happening under the hood (e.g. display message that embeddings are being calculated)

@PeterOttoMoeller PeterOttoMoeller marked this pull request as draft July 14, 2025 14:16
@PeterOttoMoeller PeterOttoMoeller requested a review from ritch July 14, 2025 19:44
@brimoor
Copy link
Contributor

brimoor commented Jul 15, 2025

Do I have to test this with fo-e in addition to tests with fo?

Core plugins do need to work in FOE. Unless the operators are doing something unique/special, this generally works "for free" with no additional effort required.

I haven't run these operators locally myself, but upon static review I don't see anything that would require special consideration to make them work in FOE.

The brain plugin does not have any tests in this repository – should any be added here?

We don't currently have automated tests for plugins/operators. We only add tests to the underlying core fiftyone and fiftyone.brain methods that they call.

@PeterOttoMoeller
Copy link
Author

Do I have to test this with fo-e in addition to tests with fo?

Core plugins do need to work in FOE. Unless the operators are doing something unique/special, this generally works "for free" with no additional effort required.

I haven't run these operators locally myself, but upon static review I don't see anything that would require special consideration to make them work in FOE.

The brain plugin does not have any tests in this repository – should any be added here?

We don't currently have automated tests for plugins/operators. We only add tests to the underlying core fiftyone and fiftyone.brain methods that they call.

Got it, so, just to make doubly sure ;-) -- no additional tests are necessary here, other than "does this work locally for me with source install?" ?

@PeterOttoMoeller PeterOttoMoeller marked this pull request as ready for review July 17, 2025 16:55
Copy link
Contributor

@manushreegangwar manushreegangwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterOttoMoeller I was able to try all the new operators. They work well! I noticed a couple of issues:

  1. When there are no exact duplicates`, the output shows:
Screenshot from 2025-07-21 13-35-53 Can you change this to show 0? The same is true for "approximate" duplicates.
  1. Let's say you compute "approximate" duplicates and delete them. This works fine. If you try to compute approximate duplicate with a higher distance threshold, representatives-of-approximate-duplicates-view shows the images from the first time you ran the operator. However, approximate-duplicates-view shows the correct result from the second operator run.
    Can you try to reproduce this and find a fix?

@PeterOttoMoeller
Copy link
Author

@PeterOttoMoeller I was able to try all the new operators. They work well! I noticed a couple of issues:

1. When there are no `exact` duplicates`, the output  shows:

Screenshot from 2025-07-21 13-35-53 Can you change this to show 0? The same is true for "approximate" duplicates.

2. Let's say you compute "approximate" duplicates and delete them. This works fine. If you try to compute approximate duplicate with a higher distance threshold, `representatives-of-approximate-duplicates-view` shows the images from the first time you ran the operator. However, `approximate-duplicates-view` shows the correct result from the second operator run.
   Can you try to reproduce this and find a fix?

Regarding point 1), I traced this to here: https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/app/packages/core/src/plugins/SchemaIO/components/LabelValueView.tsx#L20 and have opened a separate PR here: voxel51/fiftyone#6173

@brimoor brimoor changed the title Graduate image deduplication plugin to fob Graduate image deduplication to Brain plugin Jul 24, 2025
@@ -180,6 +180,64 @@ fob.compute_hardness(dataset_or_view, label_field, ...)

where the operator's form allows you to configure all relevant parameters.

### find_exact_duplicates + deduplicate_exact_duplicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation to the README for consistency with other operators in @voxel51/brain plugin


find_exact_duplicates_inputs(ctx, inputs)

notice = types.Notice(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a notice that explains what the operator will do when you execute it:

Screenshot 2025-07-24 at 2 15 02 AM



def find_exact_duplicates_inputs(ctx, inputs):
get_target_view(ctx, inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added support for specifying a target view for exact duplicate computation, for consistency with how near duplicates works:

Screenshot 2025-07-24 at 2 15 17 AM

"representatives of exact duplicates"
)

warning = types.Warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning explaining what will be deleted if you execute this method

else:
deduplicate_exact_duplicates_inputs(ctx, inputs)

warning = types.Warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where there is no existing exact duplicates scan, I added a warning explaining what triggering this method will do:

Screenshot 2025-07-24 at 2 15 48 AM


find_near_duplicates_inputs(ctx, inputs)

notice = types.Notice(
Copy link
Contributor

@brimoor brimoor Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a notice explaining what the method will do:

Screenshot 2025-07-24 at 2 15 31 AM

def find_near_duplicates_inputs(ctx, inputs):
target_view = get_target_view(ctx, inputs)

similarity_runs = ctx.dataset.list_brain_runs(type="similarity")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this syntax is available to check for similarity runs:

ctx.dataset.list_brain_runs(type="similarity")

"representatives of near duplicates"
)

warning = types.Warning(
Copy link
Contributor

@brimoor brimoor Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning explaining what samples will be deleted if you execute this method

else:
find_near_duplicates_inputs(ctx, inputs)

warning = types.Warning(
Copy link
Contributor

@brimoor brimoor Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning for the case where there is no existing near duplicates scan explaining what will happen:

Screenshot 2025-07-24 at 2 16 10 AM

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

I tested all variations of this method after my commit on the following dataset:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")
dataset.add_collection(dataset, new_ids=True)

session = fo.launch_app(dataset)

dup_ids.append(rep_id)
dup_ids.extend(neighbor_ids)

dups_view = dataset.select(dup_ids, ordered=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This select stage is bound by the mongo 16mb stage limit, meaning this function will only work for <~1 million sample datasets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets with <1M duplicates, which is a reasonable assumption.

reps_view = dataset.select(rep_ids)

dataset.save_view("exact duplicates", dups_view, overwrite=True)
dataset.save_view(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large datasets (large list of ids) this could significantly slow down dataset loading/reloading

@brimoor
Copy link
Contributor

brimoor commented Jul 29, 2025

@PeterOttoMoeller this is ready to merge from my seat. Are you working on anything else here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants