-
Notifications
You must be signed in to change notification settings - Fork 21
Graduate image deduplication to Brain plugin #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Graduate image deduplication to Brain plugin #238
Conversation
Core plugins do need to work in FOE. Unless the operators are doing something unique/special, this generally works "for free" with no additional effort required. I haven't run these operators locally myself, but upon static review I don't see anything that would require special consideration to make them work in FOE.
We don't currently have automated tests for plugins/operators. We only add tests to the underlying core |
Got it, so, just to make doubly sure ;-) -- no additional tests are necessary here, other than "does this work locally for me with source install?" ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeterOttoMoeller I was able to try all the new operators. They work well! I noticed a couple of issues:
- When there are no
exact
duplicates`, the output shows:

- Let's say you compute "approximate" duplicates and delete them. This works fine. If you try to compute approximate duplicate with a higher distance threshold,
representatives-of-approximate-duplicates-view
shows the images from the first time you ran the operator. However,approximate-duplicates-view
shows the correct result from the second operator run.
Can you try to reproduce this and find a fix?
Regarding point 1), I traced this to here: https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/app/packages/core/src/plugins/SchemaIO/components/LabelValueView.tsx#L20 and have opened a separate PR here: voxel51/fiftyone#6173 |
@@ -180,6 +180,64 @@ fob.compute_hardness(dataset_or_view, label_field, ...) | |||
|
|||
where the operator's form allows you to configure all relevant parameters. | |||
|
|||
### find_exact_duplicates + deduplicate_exact_duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added documentation to the README for consistency with other operators in @voxel51/brain
plugin
|
||
find_exact_duplicates_inputs(ctx, inputs) | ||
|
||
notice = types.Notice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def find_exact_duplicates_inputs(ctx, inputs): | ||
get_target_view(ctx, inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"representatives of exact duplicates" | ||
) | ||
|
||
warning = types.Warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a warning explaining what will be deleted if you execute this method
else: | ||
deduplicate_exact_duplicates_inputs(ctx, inputs) | ||
|
||
warning = types.Warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
find_near_duplicates_inputs(ctx, inputs) | ||
|
||
notice = types.Notice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def find_near_duplicates_inputs(ctx, inputs): | ||
target_view = get_target_view(ctx, inputs) | ||
|
||
similarity_runs = ctx.dataset.list_brain_runs(type="similarity") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this syntax is available to check for similarity runs:
ctx.dataset.list_brain_runs(type="similarity")
"representatives of near duplicates" | ||
) | ||
|
||
warning = types.Warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a warning explaining what samples will be deleted if you execute this method
else: | ||
find_near_duplicates_inputs(ctx, inputs) | ||
|
||
warning = types.Warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
I tested all variations of this method after my commit on the following dataset:
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
dataset.add_collection(dataset, new_ids=True)
session = fo.launch_app(dataset)
dup_ids.append(rep_id) | ||
dup_ids.extend(neighbor_ids) | ||
|
||
dups_view = dataset.select(dup_ids, ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This select stage is bound by the mongo 16mb stage limit, meaning this function will only work for <~1 million sample datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datasets with <1M duplicates, which is a reasonable assumption.
reps_view = dataset.select(rep_ids) | ||
|
||
dataset.save_view("exact duplicates", dups_view, overwrite=True) | ||
dataset.save_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For large datasets (large list of ids) this could significantly slow down dataset loading/reloading
@PeterOttoMoeller this is ready to merge from my seat. Are you working on anything else here? |
This takes functionality implemented here: https://github.com/jacobmarks/image-deduplication-plugin/tree/main and makes it part of the brain plugin.
Overview
Find exact (using file hashes) or approximate (using embeddings) image duplicates and display them. Remove all but one image in each group of duplicates ("deduplication").
Comments
find_..._duplicate_images
operators.General questions
Functionality that could be added