Graduate image deduplication to Brain plugin #238

PeterOttoMoeller · 2025-07-14T14:14:28Z

This takes functionality implemented here: https://github.com/jacobmarks/image-deduplication-plugin/tree/main and makes it part of the brain plugin.

Overview

Find exact (using file hashes) or approximate (using embeddings) image duplicates and display them. Remove all but one image in each group of duplicates ("deduplication").

Comments

Some of the long variable names are due to the fact that they are directly displayed in the app (like the different views). I thought descriptiveness was more important than conciseness here.
The plugin also had functionality to remove all (!) images of a duplicate group, as opposed to removing all but one (“deduplication”). I did not include this here since I did not think it was a relevant use case.
The plugin included separate operators to display duplicates. I did not include those here since displaying duplicates can be achieved using the different views generated by the find_..._duplicate_images operators.

General questions

Is this the correct repo/place?
This works locally, using local source installs of fo, fob, fop – how do I test if this works with the correct build artifacts after regular install?
Do I have to test this with fo-e in addition to tests with fo?
The brain plugin does not have any tests in this repository – should any be added here?

Functionality that could be added

make displaying approximate duplicates nicer such that there is no ambiguity which images belong to which duplicate group (how ambiguous that looks like depends on the dataset and the model used for generating embeddings; one-off observation: clip embeddings seem to produce less ambiguous approximate duplicate clusters than resnet18 embeddings)
let user choose which image among approximate duplicates to keep during deduplication (might be impractical for any dataset larger than a toy example)
make user aware of computation happening under the hood (e.g. display message that embeddings are being calculated)

… to brain plugin

package.json

plugins/brain/__init__.py

brimoor · 2025-07-15T05:11:55Z

Do I have to test this with fo-e in addition to tests with fo?

Core plugins do need to work in FOE. Unless the operators are doing something unique/special, this generally works "for free" with no additional effort required.

I haven't run these operators locally myself, but upon static review I don't see anything that would require special consideration to make them work in FOE.

The brain plugin does not have any tests in this repository – should any be added here?

We don't currently have automated tests for plugins/operators. We only add tests to the underlying core fiftyone and fiftyone.brain methods that they call.

PeterOttoMoeller · 2025-07-15T15:40:23Z

Do I have to test this with fo-e in addition to tests with fo?

Core plugins do need to work in FOE. Unless the operators are doing something unique/special, this generally works "for free" with no additional effort required.

I haven't run these operators locally myself, but upon static review I don't see anything that would require special consideration to make them work in FOE.

The brain plugin does not have any tests in this repository – should any be added here?

We don't currently have automated tests for plugins/operators. We only add tests to the underlying core fiftyone and fiftyone.brain methods that they call.

Got it, so, just to make doubly sure ;-) -- no additional tests are necessary here, other than "does this work locally for me with source install?" ?

manushreegangwar

@PeterOttoMoeller I was able to try all the new operators. They work well! I noticed a couple of issues:

When there are no exact duplicates`, the output shows:

Can you change this to show 0? The same is true for "approximate" duplicates.

Let's say you compute "approximate" duplicates and delete them. This works fine. If you try to compute approximate duplicate with a higher distance threshold, representatives-of-approximate-duplicates-view shows the images from the first time you ran the operator. However, approximate-duplicates-view shows the correct result from the second operator run.
Can you try to reproduce this and find a fix?

PeterOttoMoeller · 2025-07-23T19:54:47Z

@PeterOttoMoeller I was able to try all the new operators. They work well! I noticed a couple of issues:

1. When there are no `exact` duplicates`, the output  shows:

Can you change this to show 0? The same is true for "approximate" duplicates.

2. Let's say you compute "approximate" duplicates and delete them. This works fine. If you try to compute approximate duplicate with a higher distance threshold, `representatives-of-approximate-duplicates-view` shows the images from the first time you ran the operator. However, `approximate-duplicates-view` shows the correct result from the second operator run.
   Can you try to reproduce this and find a fix?

Regarding point 1), I traced this to here: https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/app/packages/core/src/plugins/SchemaIO/components/LabelValueView.tsx#L20 and have opened a separate PR here: voxel51/fiftyone#6173

brimoor · 2025-07-24T06:24:52Z

plugins/brain/README.md

@@ -180,6 +180,64 @@ fob.compute_hardness(dataset_or_view, label_field, ...)

 where the operator's form allows you to configure all relevant parameters.

+### find_exact_duplicates + deduplicate_exact_duplicates


I added documentation to the README for consistency with other operators in @voxel51/brain plugin

brimoor · 2025-07-24T06:27:04Z

plugins/brain/__init__.py

+
+        find_exact_duplicates_inputs(ctx, inputs)
+
+        notice = types.Notice(


I added a notice that explains what the operator will do when you execute it:

brimoor · 2025-07-24T06:27:53Z

plugins/brain/__init__.py

+
+
+def find_exact_duplicates_inputs(ctx, inputs):
+    get_target_view(ctx, inputs)


I also added support for specifying a target view for exact duplicate computation, for consistency with how near duplicates works:

brimoor · 2025-07-24T06:28:24Z

plugins/brain/__init__.py

+                "representatives of exact duplicates"
+            )
+
+            warning = types.Warning(


I added a warning explaining what will be deleted if you execute this method

brimoor · 2025-07-24T06:29:11Z

plugins/brain/__init__.py

+        else:
+            deduplicate_exact_duplicates_inputs(ctx, inputs)
+
+            warning = types.Warning(


In the case where there is no existing exact duplicates scan, I added a warning explaining what triggering this method will do:

brimoor · 2025-07-24T06:29:39Z

plugins/brain/__init__.py

+
+        find_near_duplicates_inputs(ctx, inputs)
+
+        notice = types.Notice(


I added a notice explaining what the method will do:

brimoor · 2025-07-24T06:30:09Z

plugins/brain/__init__.py

+def find_near_duplicates_inputs(ctx, inputs):
+    target_view = get_target_view(ctx, inputs)
+
+    similarity_runs = ctx.dataset.list_brain_runs(type="similarity")


Note that this syntax is available to check for similarity runs:

ctx.dataset.list_brain_runs(type="similarity")

brimoor · 2025-07-24T06:30:51Z

plugins/brain/__init__.py

+                "representatives of near duplicates"
+            )
+
+            warning = types.Warning(


I added a warning explaining what samples will be deleted if you execute this method

brimoor · 2025-07-24T06:31:50Z

plugins/brain/__init__.py

+        else:
+            find_near_duplicates_inputs(ctx, inputs)
+
+            warning = types.Warning(


I added a warning for the case where there is no existing near duplicates scan explaining what will happen:

brimoor

LGTM 🚀

I tested all variations of this method after my commit on the following dataset:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")
dataset.add_collection(dataset, new_ids=True)

session = fo.launch_app(dataset)

kaixi-wang · 2025-07-28T14:12:31Z

plugins/brain/__init__.py

+        dup_ids.append(rep_id)
+        dup_ids.extend(neighbor_ids)
+
+    dups_view = dataset.select(dup_ids, ordered=True)


This select stage is bound by the mongo 16mb stage limit, meaning this function will only work for <~1 million sample datasets

Datasets with <1M duplicates, which is a reasonable assumption.

kaixi-wang · 2025-07-28T14:15:37Z

plugins/brain/__init__.py

+    reps_view = dataset.select(rep_ids)
+
+    dataset.save_view("exact duplicates", dups_view, overwrite=True)
+    dataset.save_view(


For large datasets (large list of ids) this could significantly slow down dataset loading/reloading

brimoor · 2025-07-29T15:22:49Z

@PeterOttoMoeller this is ready to merge from my seat. Are you working on anything else here?

Initial commit, adding all relevant operators for image deduplication…

f1ba715

… to brain plugin

PeterOttoMoeller requested review from manushreegangwar and jacobsela July 14, 2025 14:15

PeterOttoMoeller marked this pull request as draft July 14, 2025 14:16

PeterOttoMoeller requested a review from ritch July 14, 2025 19:44

brimoor reviewed Jul 15, 2025

View reviewed changes

PeterOttoMoeller added 3 commits July 17, 2025 12:48

First round of PR comments

c6c8381

2nd round of PR comments

cf423ba

some cleanup

d510ee6

PeterOttoMoeller marked this pull request as ready for review July 17, 2025 16:55

manushreegangwar reviewed Jul 21, 2025

View reviewed changes

PeterOttoMoeller mentioned this pull request Jul 23, 2025

Display the value 0 correctly in LabelValueView voxel51/fiftyone#6173

Merged

5 tasks

support target views and add explanatory messages

3f34373

brimoor changed the title ~~Graduate image deduplication plugin to fob~~ Graduate image deduplication to Brain plugin Jul 24, 2025

brimoor reviewed Jul 24, 2025

View reviewed changes

brimoor approved these changes Jul 24, 2025

View reviewed changes

PeterOttoMoeller mentioned this pull request Jul 28, 2025

When saving dataset views, change overwrite to update, preserving view ID voxel51/fiftyone#6193

Closed

7 tasks

kaixi-wang reviewed Jul 28, 2025

View reviewed changes

		@@ -180,6 +180,64 @@ fob.compute_hardness(dataset_or_view, label_field, ...)

		where the operator's form allows you to configure all relevant parameters.

		### find_exact_duplicates + deduplicate_exact_duplicates


		find_exact_duplicates_inputs(ctx, inputs)

		notice = types.Notice(



		def find_exact_duplicates_inputs(ctx, inputs):
		get_target_view(ctx, inputs)


		find_near_duplicates_inputs(ctx, inputs)

		notice = types.Notice(

Graduate image deduplication to Brain plugin #238

Are you sure you want to change the base?

Graduate image deduplication to Brain plugin #238

Uh oh!

Conversation

PeterOttoMoeller commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Comments

General questions

Functionality that could be added

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brimoor commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterOttoMoeller commented Jul 15, 2025

Uh oh!

manushreegangwar left a comment

Choose a reason for hiding this comment

Uh oh!

PeterOttoMoeller commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brimoor Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brimoor Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brimoor Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brimoor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brimoor commented Jul 29, 2025

Uh oh!

Uh oh!

PeterOttoMoeller commented Jul 14, 2025 •

edited

Loading

brimoor commented Jul 15, 2025 •

edited

Loading

brimoor Jul 24, 2025 •

edited

Loading

brimoor Jul 24, 2025 •

edited

Loading

brimoor Jul 24, 2025 •

edited

Loading

brimoor left a comment •

edited

Loading