Skip to content

Commit f8e07cc

Browse files
committed
Merge branch 'release/v0.6.0'
2 parents e1d28ef + e26eee2 commit f8e07cc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2285
-1031
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.5.0
2+
current_version = 0.6.0
33
commit = False
44
tag = False
55
allow_dirty = False

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
notebooks/*.ipynb -linguist-detectable

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,37 @@
11
# Changelog
22

3+
## 0.6.0 - 🆕 New algorithms, cleanup and bug fixes 🏗
4+
5+
- Fixes in `ValuationResult`: bugs around data names, semantics of
6+
`empty()`, new method `zeros()` and normalised random values
7+
[PR #327](https://github.com/appliedAI-Initiative/pyDVL/pull/327)
8+
- **New method**: Implements generalised semi-values for data valuation,
9+
including Data Banzhaf and Beta Shapley, with configurable sampling strategies
10+
[PR #319](https://github.com/appliedAI-Initiative/pyDVL/pull/319)
11+
- Adds kwargs parameter to `from_array` and `from_sklearn`
12+
Dataset and GroupedDataset class methods
13+
[PR #316](https://github.com/appliedAI-Initiative/pyDVL/pull/316)
14+
- PEP-561 conformance: added `py.typed`
15+
[PR #307](https://github.com/appliedAI-Initiative/pyDVL/pull/307)
16+
- Removed default non-negativity constraint on least core subsidy
17+
and added instead a `non_negative_subsidy` boolean flag.
18+
Renamed `options` to `solver_options` and pass it as dict.
19+
Change default least-core solver to SCS with 10000 max_iters.
20+
[PR #304](https://github.com/appliedAI-Initiative/pyDVL/pull/304)
21+
- Cleanup: removed unnecessary decorator `@unpackable`
22+
[PR #233](https://github.com/appliedAI-Initiative/pyDVL/pull/233)
23+
- Stopping criteria: fixed problem with `StandardError` and enable proper
24+
composition of index convergence statuses. Fixed a bug with `n_jobs` in
25+
`truncated_montecarlo_shapley`.
26+
[PR #300](https://github.com/appliedAI-Initiative/pyDVL/pull/300) and
27+
[PR #305](https://github.com/appliedAI-Initiative/pyDVL/pull/305)
28+
- Shuffling code around to allow for simpler user imports, some cleanup and
29+
documentation fixes.
30+
[PR #284](https://github.com/appliedAI-Initiative/pyDVL/pull/284)
31+
- **Bug fix**: Warn instead of raising an error when `n_iterations`
32+
is less than the size of the dataset in Monte Carlo Least Core
33+
[PR #281](https://github.com/appliedAI-Initiative/pyDVL/pull/281)
34+
335
## 0.5.0 - 💥 Fixes, nicer interfaces and... more breaking changes 😒
436

537
- Fixed parallel and antithetic Owen sampling for Shapley values. Simplified

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@ Consider installing any of [black's IDE
2121
integrations](https://black.readthedocs.io/en/stable/integrations/editors.html)
2222
to make your life easier.
2323

24-
Run the following command to set up the pre-commit git hook:
24+
Run the following to set up the pre-commit git hook to run before pushes:
2525

2626
```shell script
27-
pre-commit install
27+
pre-commit install --hook-type pre-push
2828
```
2929

3030
## Setting up your environment

README.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,13 @@ methods from the following papers:
5454
[Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
5555
In 22nd International Conference on Artificial Intelligence and Statistics,
5656
1167–76. PMLR, 2019.
57+
- Wang, Jiachen T., and Ruoxi Jia.
58+
[Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
59+
arXiv, October 22, 2022.
60+
- Kwon, Yongchan, and James Zou.
61+
[Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
62+
In Proceedings of the 25th International Conference on Artificial Intelligence
63+
and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
5764

5865
Influence Functions compute the effect that single points have on an estimator /
5966
model. We implement methods from the following papers:
@@ -97,21 +104,18 @@ This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for
97104
Data Shapley values:
98105

99106
```python
100-
import numpy as np
101-
from pydvl.utils import Dataset, Utility
107+
from sklearn.datasets import load_breast_cancer
108+
from sklearn.linear_model import LogisticRegression
102109
from pydvl.value import *
103-
from sklearn.linear_model import LinearRegression
104-
from sklearn.model_selection import train_test_split
105110

106-
X, y = np.arange(100).reshape((50, 2)), np.arange(50)
107-
X_train, X_test, y_train, y_test = train_test_split(
108-
X, y, test_size=0.5, random_state=16
109-
)
110-
dataset = Dataset(X_train, y_train, X_test, y_test)
111-
model = LinearRegression()
112-
utility = Utility(model, dataset)
111+
data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
112+
model = LogisticRegression()
113+
u = Utility(model, data, Scorer("accuracy", default=0.0))
113114
values = compute_shapley_values(
114-
u=utility, mode="truncated_montecarlo", done=MaxUpdates(100)
115+
u,
116+
mode=ShapleyMode.TruncatedMontecarlo,
117+
done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
118+
truncation=RelativeTruncation(u, rtol=0.01),
115119
)
116120
```
117121

docs/30-data-valuation.rst

Lines changed: 119 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,7 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
241241
.. code-block:: python
242242
243243
from pydvl.value import compute_shapley_value
244+
244245
utility = Utility(...)
245246
values = compute_shapley_values(utility, mode="combinatorial_exact")
246247
df = values.to_dataframe(column='value')
@@ -264,7 +265,8 @@ same pattern:
264265
.. code-block:: python
265266
266267
from pydvl.utils import Dataset, Utility
267-
from pydvl.value.shapley import compute_shapley_values
268+
from pydvl.value import compute_shapley_values
269+
268270
model = ...
269271
data = Dataset(...)
270272
utility = Utility(model, data)
@@ -303,7 +305,8 @@ values in pyDVL. First construct the dataset and utility, then call
303305
.. code-block:: python
304306
305307
from pydvl.utils import Dataset, Utility
306-
from pydvl.value.shapley import compute_shapley_values
308+
from pydvl.value import compute_shapley_values
309+
307310
model = ...
308311
dataset = Dataset(...)
309312
utility = Utility(data, model)
@@ -329,11 +332,11 @@ It uses permutations over indices instead of subsets:
329332

330333
$$
331334
v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
332-
[u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})]
335+
[u(\sigma_{:i} \cup \{i\}) − u(\sigma_{:i})]
333336
,$$
334337

335-
where $\sigma_i$ denotes the set of indices in permutation sigma up until the
336-
position of index $i$. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
338+
where $\sigma_{:i}$ denotes the set of indices in permutation sigma before the
339+
position where $i$ appears. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
337340
one uses Monte Carlo sampling of permutations, something which has surprisingly
338341
low sample complexity. By adding early stopping, the result is the so-called
339342
**Truncated Monte Carlo Shapley** (:footcite:t:`ghorbani_data_2019`), which is
@@ -342,7 +345,7 @@ efficient enough to be useful in some applications.
342345
.. code-block:: python
343346
344347
from pydvl.utils import Dataset, Utility
345-
from pydvl.value.shapley import compute_shapley_values
348+
from pydvl.value import compute_shapley_values
346349
347350
model = ...
348351
data = Dataset(...)
@@ -364,7 +367,7 @@ and can be used in pyDVL with:
364367
.. code-block:: python
365368
366369
from pydvl.utils import Dataset, Utility
367-
from pydvl.value.shapley import compute_shapley_values
370+
from pydvl.value import compute_shapley_values
368371
from sklearn.neighbors import KNeighborsClassifier
369372
370373
model = KNeighborsClassifier(n_neighbors=5)
@@ -410,7 +413,7 @@ its variance.
410413
.. code-block:: python
411414
412415
from pydvl.utils import Dataset, Utility
413-
from pydvl.value.shapley import compute_shapley_values
416+
from pydvl.value import compute_shapley_values
414417
415418
model = ...
416419
data = Dataset(...)
@@ -449,7 +452,7 @@ It satisfies the following 2 properties:
449452
The sum of payoffs to the agents in any coalition S is at
450453
least as large as the amount that these agents could earn by
451454
forming a coalition on their own.
452-
$$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subseteq D\,$$
455+
$$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subset D\,$$
453456

454457
The second property states that the sum of payoffs to the agents
455458
in any subcoalition $S$ is at least as large as the amount that
@@ -463,7 +466,7 @@ By relaxing the coalitional rationality property by a subsidy $e \gt 0$,
463466
we are then able to find approximate payoffs:
464467

465468
$$
466-
\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subseteq D\
469+
\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subset D, S \neq \emptyset \
467470
,$$
468471

469472
The least core value $v$ of the $i$-th sample in dataset $D$ wrt.
@@ -473,7 +476,7 @@ $$
473476
\begin{array}{lll}
474477
\text{minimize} & e & \\
475478
\text{subject to} & \sum_{x_i\in D} v_u(x_i) = u(D) & \\
476-
& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subseteq D \\
479+
& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subset D, S \neq \emptyset \\
477480
\end{array}
478481
$$
479482

@@ -487,11 +490,12 @@ As such it returns as exact a value as the utility function allows
487490
.. code-block:: python
488491
489492
from pydvl.utils import Dataset, Utility
490-
from pydvl.value.least_core import exact_least_core
493+
from pydvl.value import compute_least_core_values
494+
491495
model = ...
492496
dataset = Dataset(...)
493497
utility = Utility(data, model)
494-
values = exact_least_core(utility)
498+
values = compute_least_core_values(utility, mode="exact")
495499
496500
Monte Carlo Least Core
497501
----------------------
@@ -515,16 +519,20 @@ where $e^{*}$ is the optimal least core subsidy.
515519
.. code-block:: python
516520
517521
from pydvl.utils import Dataset, Utility
518-
from pydvl.value.least_core import montecarlo_least_core
522+
from pydvl.value import compute_least_core_values
523+
519524
model = ...
520525
dataset = Dataset(...)
521526
n_iterations = ...
522527
utility = Utility(data, model)
523-
values = montecarlo_least_core(utility, n_iterations=n_iterations)
528+
values = compute_least_core_values(
529+
utility, mode="montecarlo", n_iterations=n_iterations
530+
)
524531
525532
.. note::
526533

527-
``n_iterations`` needs to be at least equal to the number of data points.
534+
Although any number is supported, it is best to choose ``n_iterations`` to be
535+
at least equal to the number of data points.
528536

529537
Because computing the Least Core values requires the solution of a linear and a
530538
quadratic problem *after* computing all the utility values, we offer the
@@ -538,6 +546,7 @@ list of problems to solve, then solve them in parallel with
538546
539547
from pydvl.utils import Dataset, Utility
540548
from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
549+
541550
model = ...
542551
dataset = Dataset(...)
543552
n_iterations = ...
@@ -548,15 +557,102 @@ list of problems to solve, then solve them in parallel with
548557
values = lc_solve_problems(problems)
549558
550559
551-
Other methods
552-
=============
560+
Semi-values
561+
===========
562+
563+
Shapley values are a particular case of a more general concept called semi-value,
564+
which is a generalization to different weighting schemes. A **semi-value** is
565+
any valuation function with the form:
566+
567+
$$
568+
v\_\text{semi}(i) = \sum_{i=1}^n w(k)
569+
\sum_{S \subset D\_{-i}^{(k)}} [U(S\_{+i})-U(S)],
570+
$$
571+
572+
where the coefficients $w(k)$ satisfy the property:
573+
574+
$$\sum_{k=1}^n w(k) = 1.$$
575+
576+
Two instances of this are **Banzhaf indices** (:footcite:t:`wang_data_2022`),
577+
and **Beta Shapley** (:footcite:t:`kwon_beta_2022`), with better numerical and
578+
rank stability in certain situations.
579+
580+
.. note::
581+
582+
Shapley values are a particular case of semi-values and can therefore also be
583+
computed with the methods described here. However, as of version 0.6.0, we
584+
recommend using :func:`~pydvl.value.shapley.compute_shapley_values` instead,
585+
in particular because it implements truncated Monte Carlo sampling for faster
586+
computation.
587+
588+
589+
Beta Shapley
590+
^^^^^^^^^^^^
591+
592+
For some machine learning applications, where the utility is typically the
593+
performance when trained on a set $S \subset D$, diminishing returns are often
594+
observed when computing the marginal utility of adding a new data point.
595+
596+
Beta Shapley is a weighting scheme that uses the Beta function to place more
597+
weight on subsets deemed to be more informative. The weights are defined as:
598+
599+
$$
600+
w(k) := \frac{B(k+\beta, n-k+1+\alpha)}{B(\alpha, \beta)},
601+
$$
602+
603+
where $B$ is the `Beta function <https://en.wikipedia.org/wiki/Beta_function>`_,
604+
and $\alpha$ and $\beta$ are parameters that control the weighting of the
605+
subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$, and
606+
$\beta = 16$ is reported in :footcite:t:`kwon_beta_2022` to be a good choice for
607+
some applications. See however :ref:`banzhaf indices` for an alternative choice
608+
of weights which is reported to work better.
609+
610+
.. code-block:: python
611+
612+
from pydvl.utils import Dataset, Utility
613+
from pydvl.value import compute_semivalues
614+
615+
model = ...
616+
data = Dataset(...)
617+
utility = Utility(model, data)
618+
values = compute_semivalues(
619+
u=utility, mode="beta_shapley", done=MaxUpdates(500), alpha=1, beta=16
620+
)
621+
622+
.. _banzhaf indices:
553623

554-
There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
555-
of semivalue, which is a generalization to different weighting schemes:
556-
in particular **Banzhaf indices** and **Beta Shapley**, with better numerical
557-
and rank stability in certain situations.
624+
Banzhaf indices
625+
^^^^^^^^^^^^^^^
558626

559-
Contributions are welcome!
627+
As noted below in :ref:`problems of data values`, the Shapley value can be very
628+
sensitive to variance in the utility function. For machine learning applications,
629+
where the utility is typically the performance when trained on a set $S \subset
630+
D$, this variance is often largest for smaller subsets $S$. It is therefore
631+
reasonable to try reducing the relative contribution of these subsets with
632+
adequate weights.
633+
634+
One such choice of weights is the Banzhaf index, which is defined as the
635+
constant:
636+
637+
$$w(k) := 2^{n-1},$$
638+
639+
for all set sizes $k$. The intuition for picking a constant weight is that for
640+
any choice of weight function $w$, one can always construct a utility with
641+
higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
642+
one can do is to pick a constant weight.
643+
644+
The authors of :footcite:t:`wang_data_2022` show that Banzhaf indices are more
645+
robust to variance in the utility function than Shapley and Beta Shapley values.
646+
647+
.. code-block:: python
648+
649+
from pydvl.utils import Dataset, Utility
650+
from pydvl.value import compute_semivalues
651+
652+
model = ...
653+
data = Dataset(...)
654+
utility = Utility(model, data)
655+
values = compute_semivalues( u=utility, mode="banzhaf", done=MaxUpdates(500))
560656
561657
562658
.. _problems of data values:

docs/conf.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,6 @@
4343
"sphinx.ext.extlinks",
4444
"sphinx_math_dollar",
4545
"sphinx.ext.todo",
46-
"sphinx_rtd_theme",
4746
"hoverxref.extension", # This only works on read the docs
4847
"sphinx_design",
4948
"sphinxcontrib.bibtex",
@@ -98,6 +97,16 @@
9897
.nboutput .prompt {
9998
display: none;
10099
}
100+
@media not print {
101+
[data-theme='dark'] .output_area img {
102+
filter: invert(0.9);
103+
}
104+
@media (prefers-color-scheme: dark) {
105+
:root:not([data-theme="light"]) .output_area img {
106+
filter: invert(0.9);
107+
}
108+
}
109+
}
101110
</style>
102111
"""
103112

@@ -325,7 +334,7 @@ def lineno_from_object_name(source_file, object_name):
325334

326335
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
327336
html_show_copyright = True
328-
copyright = "2022 AppliedAI Institute gGmbH"
337+
copyright = "AppliedAI Institute gGmbH"
329338

330339
# If true, an OpenSearch description file will be output, and all pages will
331340
# contain a <link> tag referring to it. The value of this option must be the

0 commit comments

Comments
 (0)