@@ -241,6 +241,7 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
241
241
.. code-block :: python
242
242
243
243
from pydvl.value import compute_shapley_value
244
+
244
245
utility = Utility(... )
245
246
values = compute_shapley_values(utility, mode = " combinatorial_exact" )
246
247
df = values.to_dataframe(column = ' value' )
@@ -264,7 +265,8 @@ same pattern:
264
265
.. code-block :: python
265
266
266
267
from pydvl.utils import Dataset, Utility
267
- from pydvl.value.shapley import compute_shapley_values
268
+ from pydvl.value import compute_shapley_values
269
+
268
270
model = ...
269
271
data = Dataset(... )
270
272
utility = Utility(model, data)
@@ -303,7 +305,8 @@ values in pyDVL. First construct the dataset and utility, then call
303
305
.. code-block :: python
304
306
305
307
from pydvl.utils import Dataset, Utility
306
- from pydvl.value.shapley import compute_shapley_values
308
+ from pydvl.value import compute_shapley_values
309
+
307
310
model = ...
308
311
dataset = Dataset(... )
309
312
utility = Utility(data, model)
@@ -329,11 +332,11 @@ It uses permutations over indices instead of subsets:
329
332
330
333
$$
331
334
v_u(x_i) = \f rac{1}{n!} \s um_{\s igma \i n \P i(n)}
332
- [u(\s igma_{i-1 } \c up {i }) − u(\s igma_{i})]
335
+ [u(\s igma_{:i } \c up \{ i \ } ) − u(\s igma_{: i})]
333
336
,$$
334
337
335
- where $\s igma_i $ denotes the set of indices in permutation sigma up until the
336
- position of index $i$. To approximate this sum (with $\m athcal{O}(n!)$ terms!)
338
+ where $\s igma_{:i} $ denotes the set of indices in permutation sigma before the
339
+ position where $i$ appears . To approximate this sum (with $\m athcal{O}(n!)$ terms!)
337
340
one uses Monte Carlo sampling of permutations, something which has surprisingly
338
341
low sample complexity. By adding early stopping, the result is the so-called
339
342
**Truncated Monte Carlo Shapley ** (:footcite:t: `ghorbani_data_2019 `), which is
@@ -342,7 +345,7 @@ efficient enough to be useful in some applications.
342
345
.. code-block :: python
343
346
344
347
from pydvl.utils import Dataset, Utility
345
- from pydvl.value.shapley import compute_shapley_values
348
+ from pydvl.value import compute_shapley_values
346
349
347
350
model = ...
348
351
data = Dataset(... )
@@ -364,7 +367,7 @@ and can be used in pyDVL with:
364
367
.. code-block :: python
365
368
366
369
from pydvl.utils import Dataset, Utility
367
- from pydvl.value.shapley import compute_shapley_values
370
+ from pydvl.value import compute_shapley_values
368
371
from sklearn.neighbors import KNeighborsClassifier
369
372
370
373
model = KNeighborsClassifier(n_neighbors = 5 )
@@ -410,7 +413,7 @@ its variance.
410
413
.. code-block :: python
411
414
412
415
from pydvl.utils import Dataset, Utility
413
- from pydvl.value.shapley import compute_shapley_values
416
+ from pydvl.value import compute_shapley_values
414
417
415
418
model = ...
416
419
data = Dataset(... )
@@ -449,7 +452,7 @@ It satisfies the following 2 properties:
449
452
The sum of payoffs to the agents in any coalition S is at
450
453
least as large as the amount that these agents could earn by
451
454
forming a coalition on their own.
452
- $$\s um_{x_i\i n S} v_u(x_i) \g eq u(S), \f orall S \s ubseteq D\, $$
455
+ $$\s um_{x_i\i n S} v_u(x_i) \g eq u(S), \f orall S \s ubset D\, $$
453
456
454
457
The second property states that the sum of payoffs to the agents
455
458
in any subcoalition $S$ is at least as large as the amount that
@@ -463,7 +466,7 @@ By relaxing the coalitional rationality property by a subsidy $e \gt 0$,
463
466
we are then able to find approximate payoffs:
464
467
465
468
$$
466
- \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S), \f orall S \s ubseteq D \
469
+ \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S), \f orall S \s ubset D, S \n eq \e mptyset \
467
470
,$$
468
471
469
472
The least core value $v$ of the $i$-th sample in dataset $D$ wrt.
473
476
\b egin{array}{lll}
474
477
\t ext{minimize} & e & \\
475
478
\t ext{subject to} & \s um_{x_i\i n D} v_u(x_i) = u(D) & \\
476
- & \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S) &, \f orall S \s ubseteq D \\
479
+ & \s um_{x_i\i n S} v_u(x_i) + e \g eq u(S) &, \f orall S \s ubset D, S \n eq \e mptyset \\
477
480
\e nd{array}
478
481
$$
479
482
@@ -487,11 +490,12 @@ As such it returns as exact a value as the utility function allows
487
490
.. code-block :: python
488
491
489
492
from pydvl.utils import Dataset, Utility
490
- from pydvl.value.least_core import exact_least_core
493
+ from pydvl.value import compute_least_core_values
494
+
491
495
model = ...
492
496
dataset = Dataset(... )
493
497
utility = Utility(data, model)
494
- values = exact_least_core (utility)
498
+ values = compute_least_core_values (utility, mode = " exact " )
495
499
496
500
Monte Carlo Least Core
497
501
----------------------
@@ -515,16 +519,20 @@ where $e^{*}$ is the optimal least core subsidy.
515
519
.. code-block :: python
516
520
517
521
from pydvl.utils import Dataset, Utility
518
- from pydvl.value.least_core import montecarlo_least_core
522
+ from pydvl.value import compute_least_core_values
523
+
519
524
model = ...
520
525
dataset = Dataset(... )
521
526
n_iterations = ...
522
527
utility = Utility(data, model)
523
- values = montecarlo_least_core(utility, n_iterations = n_iterations)
528
+ values = compute_least_core_values(
529
+ utility, mode = " montecarlo" , n_iterations = n_iterations
530
+ )
524
531
525
532
.. note ::
526
533
527
- ``n_iterations `` needs to be at least equal to the number of data points.
534
+ Although any number is supported, it is best to choose ``n_iterations `` to be
535
+ at least equal to the number of data points.
528
536
529
537
Because computing the Least Core values requires the solution of a linear and a
530
538
quadratic problem *after * computing all the utility values, we offer the
@@ -538,6 +546,7 @@ list of problems to solve, then solve them in parallel with
538
546
539
547
from pydvl.utils import Dataset, Utility
540
548
from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
549
+
541
550
model = ...
542
551
dataset = Dataset(... )
543
552
n_iterations = ...
@@ -548,15 +557,102 @@ list of problems to solve, then solve them in parallel with
548
557
values = lc_solve_problems(problems)
549
558
550
559
551
- Other methods
552
- =============
560
+ Semi-values
561
+ ===========
562
+
563
+ Shapley values are a particular case of a more general concept called semi-value,
564
+ which is a generalization to different weighting schemes. A **semi-value ** is
565
+ any valuation function with the form:
566
+
567
+ $$
568
+ v\_\t ext{semi}(i) = \s um_{i=1}^n w(k)
569
+ \s um_{S \s ubset D\_ {-i}^{(k)}} [U(S\_ {+i})-U(S)],
570
+ $$
571
+
572
+ where the coefficients $w(k)$ satisfy the property:
573
+
574
+ $$\s um_{k=1}^n w(k) = 1.$$
575
+
576
+ Two instances of this are **Banzhaf indices ** (:footcite:t: `wang_data_2022 `),
577
+ and **Beta Shapley ** (:footcite:t: `kwon_beta_2022 `), with better numerical and
578
+ rank stability in certain situations.
579
+
580
+ .. note ::
581
+
582
+ Shapley values are a particular case of semi-values and can therefore also be
583
+ computed with the methods described here. However, as of version 0.6.0, we
584
+ recommend using :func: `~pydvl.value.shapley.compute_shapley_values ` instead,
585
+ in particular because it implements truncated Monte Carlo sampling for faster
586
+ computation.
587
+
588
+
589
+ Beta Shapley
590
+ ^^^^^^^^^^^^
591
+
592
+ For some machine learning applications, where the utility is typically the
593
+ performance when trained on a set $S \s ubset D$, diminishing returns are often
594
+ observed when computing the marginal utility of adding a new data point.
595
+
596
+ Beta Shapley is a weighting scheme that uses the Beta function to place more
597
+ weight on subsets deemed to be more informative. The weights are defined as:
598
+
599
+ $$
600
+ w(k) := \f rac{B(k+\b eta, n-k+1+\a lpha)}{B(\a lpha, \b eta)},
601
+ $$
602
+
603
+ where $B$ is the `Beta function <https://en.wikipedia.org/wiki/Beta_function >`_,
604
+ and $\a lpha$ and $\b eta$ are parameters that control the weighting of the
605
+ subsets. Setting both to 1 recovers Shapley values, and setting $\a lpha = 1$, and
606
+ $\b eta = 16$ is reported in :footcite:t: `kwon_beta_2022 ` to be a good choice for
607
+ some applications. See however :ref: `banzhaf indices ` for an alternative choice
608
+ of weights which is reported to work better.
609
+
610
+ .. code-block :: python
611
+
612
+ from pydvl.utils import Dataset, Utility
613
+ from pydvl.value import compute_semivalues
614
+
615
+ model = ...
616
+ data = Dataset(... )
617
+ utility = Utility(model, data)
618
+ values = compute_semivalues(
619
+ u = utility, mode = " beta_shapley" , done = MaxUpdates(500 ), alpha = 1 , beta = 16
620
+ )
621
+
622
+ .. _banzhaf indices :
553
623
554
- There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
555
- of semivalue, which is a generalization to different weighting schemes:
556
- in particular **Banzhaf indices ** and **Beta Shapley **, with better numerical
557
- and rank stability in certain situations.
624
+ Banzhaf indices
625
+ ^^^^^^^^^^^^^^^
558
626
559
- Contributions are welcome!
627
+ As noted below in :ref: `problems of data values `, the Shapley value can be very
628
+ sensitive to variance in the utility function. For machine learning applications,
629
+ where the utility is typically the performance when trained on a set $S \s ubset
630
+ D$, this variance is often largest for smaller subsets $S$. It is therefore
631
+ reasonable to try reducing the relative contribution of these subsets with
632
+ adequate weights.
633
+
634
+ One such choice of weights is the Banzhaf index, which is defined as the
635
+ constant:
636
+
637
+ $$w(k) := 2^{n-1},$$
638
+
639
+ for all set sizes $k$. The intuition for picking a constant weight is that for
640
+ any choice of weight function $w$, one can always construct a utility with
641
+ higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
642
+ one can do is to pick a constant weight.
643
+
644
+ The authors of :footcite:t: `wang_data_2022 ` show that Banzhaf indices are more
645
+ robust to variance in the utility function than Shapley and Beta Shapley values.
646
+
647
+ .. code-block :: python
648
+
649
+ from pydvl.utils import Dataset, Utility
650
+ from pydvl.value import compute_semivalues
651
+
652
+ model = ...
653
+ data = Dataset(... )
654
+ utility = Utility(model, data)
655
+ values = compute_semivalues( u = utility, mode = " banzhaf" , done = MaxUpdates(500 ))
560
656
561
657
562
658
.. _problems of data values :
0 commit comments