@@ -118,6 +118,34 @@ is implemented, it is important not to reuse `Utility` objects for different
118
118
datasets. You can read more about :ref: `caching setup ` in the installation guide
119
119
and the documentation of the :mod: `pydvl.utils.caching ` module.
120
120
121
+ Using custom scorers
122
+ ^^^^^^^^^^^^^^^^^^^^
123
+
124
+ The `scoring ` argument of :class: `~pydvl.utils.utility.Utility ` can be used to
125
+ specify a custom :class: `~pydvl.utils.utility.Scorer ` object. This is a simple
126
+ wrapper for a callable that takes a model, and test data and returns a score.
127
+
128
+ More importantly, the object provides information about the range of the score,
129
+ which is used by some methods by estimate the number of samples necessary, and
130
+ about what default value to use when the model fails to train.
131
+
132
+ .. note ::
133
+ The most important property of a `Scorer ` is its default value. Because many
134
+ models will fail to fit on small subsets of the data, it is important to
135
+ provide a sensible default value for the score.
136
+
137
+ It is possible to skip the construction of the :class: `~pydvl.utils.utility.Scorer `
138
+ when constructing the `Utility ` object. The two following calls are equivalent:
139
+
140
+ .. code-block :: python
141
+
142
+ utility = Utility(
143
+ model, dataset, " explained_variance" , score_range = (- np.inf, 1 ), default_score = 0.0
144
+ )
145
+ utility = Utility(
146
+ model, dataset, Scorer(" explained_variance" , range = (- np.inf, 1 ), default = 0.0 )
147
+ )
148
+
121
149
Learning the utility
122
150
^^^^^^^^^^^^^^^^^^^^
123
151
@@ -174,7 +202,7 @@ definitions, but other methods are typically preferable.
174
202
values = naive_loo(utility)
175
203
176
204
The return value of all valuation functions is an object of type
177
- :class: `~pydvl.value.results .ValuationResult `. This can be iterated over,
205
+ :class: `~pydvl.value.result .ValuationResult `. This can be iterated over,
178
206
indexed with integers, slices and Iterables, as well as converted to a
179
207
`pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html >`_.
180
208
@@ -217,11 +245,11 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
217
245
values = compute_shapley_values(utility, mode = " combinatorial_exact" )
218
246
df = values.to_dataframe(column = ' value' )
219
247
220
- We convert the return value to a
248
+ We can convert the return value to a
221
249
`pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html >`_
222
250
and name the column with the results as `value `. Please refer to the
223
251
documentation in :mod: `pydvl.value.shapley ` and
224
- :class: `~pydvl.value.results .ValuationResult ` for more information.
252
+ :class: `~pydvl.value.result .ValuationResult ` for more information.
225
253
226
254
Monte Carlo Combinatorial Shapley
227
255
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -240,12 +268,19 @@ same pattern:
240
268
model = ...
241
269
data = Dataset(... )
242
270
utility = Utility(model, data)
243
- values = compute_shapley_values(utility, mode = " combinatorial_montecarlo" )
271
+ values = compute_shapley_values(
272
+ utility, mode = " combinatorial_montecarlo" , done = MaxUpdates(1000 )
273
+ )
244
274
df = values.to_dataframe(column = ' cmc' )
245
275
246
276
The DataFrames returned by most Monte Carlo methods will contain approximate
247
277
standard errors as an additional column, in this case named `cmc_stderr `.
248
278
279
+ Note the usage of the object :class: `~pydvl.value.stopping.MaxUpdates ` as the
280
+ stop condition. This is an instance of a
281
+ :class: `~pydvl.value.stopping.StoppingCriterion `. Other examples are
282
+ :class: `~pydvl.value.stopping.MaxTime ` and :class: `~pydvl.value.stopping.StandardError `.
283
+
249
284
250
285
Owen sampling
251
286
^^^^^^^^^^^^^
@@ -281,6 +316,10 @@ sampling, and its variant *Antithetic Owen Sampling* in the documentation for th
281
316
function doing the work behind the scenes:
282
317
:func: `~pydvl.value.shapley.montecarlo.owen_sampling_shapley `.
283
318
319
+ Note that in this case we do not pass a
320
+ :class: `~pydvl.value.stopping.StoppingCriterion ` to the function, but instead
321
+ the number of iterations and the maximum number of samples to use in the
322
+ integration.
284
323
285
324
Permutation Shapley
286
325
^^^^^^^^^^^^^^^^^^^
@@ -309,7 +348,7 @@ efficient enough to be useful in some applications.
309
348
data = Dataset(... )
310
349
utility = Utility(model, data)
311
350
values = compute_shapley_values(
312
- u = utility, mode = " truncated_montecarlo" , n_iterations = 100
351
+ u = utility, mode = " truncated_montecarlo" , done = MaxUpdates( 1000 )
313
352
)
314
353
315
354
358
397
but we don't advocate its use because of the speed and memory cost. Despite
359
398
our best efforts, the number of samples required in practice for convergence
360
399
can be several orders of magnitude worse than with e.g. Truncated Monte Carlo.
400
+ Additionally, the CSP can sometimes turn out to be infeasible.
361
401
362
402
Usage follows the same pattern as every other Shapley method, but with the
363
- addition of an ``eps `` parameter required for the solution of the CSP. It should
364
- be the same value used to compute the minimum number of samples required. This
365
- can be done with :func: `~pydvl.value.shapley.gt.num_samples_eps_delta `, but note
366
- that the number returned will be huge! In practice, fewer samples can be enough,
367
- but the actual number will strongly depend on the utility, in particular its
368
- variance.
403
+ addition of an ``epsilon `` parameter required for the solution of the CSP. It
404
+ should be the same value used to compute the minimum number of samples required.
405
+ This can be done with :func: `~pydvl.value.shapley.gt.num_samples_eps_delta `, but
406
+ note that the number returned will be huge! In practice, fewer samples can be
407
+ enough, but the actual number will strongly depend on the utility, in particular
408
+ its variance.
369
409
370
410
.. code-block :: python
371
411
@@ -459,29 +499,18 @@ Monte Carlo Least Core
459
499
Because the number of subsets $S \s ubseteq D \s etminus \{ x_i\} $ is
460
500
$2^{ | D | - 1 }$, one typically must resort to approximations.
461
501
462
- The simplest approximation consists of two relaxations of the Least Core
463
- (:footcite:t: `yan_if_2021 `):
464
-
465
- - Further relaxing the coalitional rationality property by
466
- a constant value $\e psilon > 0$:
467
-
468
- $$
469
- \s um_{x_i\i n S} v_u(x_i) + e + \e psilon \g eq u(S)
470
- $$
471
-
472
- - Using a fraction of all subsets instead of all possible subsets.
473
-
474
- Combined, this gives us the $(\e psilon, \d elta)$-*probably approx-
475
- imate least core * that satisfies the following property:
502
+ The simplest approximation consists in using a fraction of all subsets for the
503
+ constraints. :footcite:t: `yan_if_2021 ` show that a quantity of order
504
+ $\m athcal{O}((n - \l og \D elta ) / \d elta^2)$ is enough to obtain a so-called
505
+ $\d elta$-*approximate least core * with high probability. I.e. the following
506
+ property holds with probability $1-\D elta$ over the choice of subsets:
476
507
477
508
$$
478
- P_{ S\s im D}\l eft[\s um_{x_i\i n S} v_u(x_i) + e^{*} + \e psilon \g eq u(S)\r ight]
479
- \g eq 1 - \d elta
509
+ \m athbb{P}_{ S\s im D}\l eft[\s um_{x_i\i n S} v_u(x_i) + e^{*} \g eq u(S)\r ight]
510
+ \g eq 1 - \d elta,
480
511
$$
481
512
482
- Where $e^{*}$ is the optimal least core subsidy.
483
-
484
- With these relaxations, we obtain a polynomial running time.
513
+ where $e^{*}$ is the optimal least core subsidy.
485
514
486
515
.. code-block :: python
487
516
@@ -497,6 +526,28 @@ With these relaxations, we obtain a polynomial running time.
497
526
498
527
``n_iterations `` needs to be at least equal to the number of data points.
499
528
529
+ Because computing the Least Core values requires the solution of a linear and a
530
+ quadratic problem *after * computing all the utility values, we offer the
531
+ possibility of splitting the latter from the former. This is useful when running
532
+ multiple experiments: use
533
+ :func: `~pydvl.value.least_core.montecarlo.mclc_prepare_problem ` to prepare a
534
+ list of problems to solve, then solve them in parallel with
535
+ :func: `~pydvl.value.least_core.common.lc_solve_problems `.
536
+
537
+ .. code-block :: python
538
+
539
+ from pydvl.utils import Dataset, Utility
540
+ from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
541
+ model = ...
542
+ dataset = Dataset(... )
543
+ n_iterations = ...
544
+ utility = Utility(data, model)
545
+ n_experiments = 10
546
+ problems = [mclc_prepare_problem(utility, n_iterations = n_iterations)
547
+ for _ in range (n_experiments)]
548
+ values = lc_solve_problems(problems)
549
+
550
+
500
551
Other methods
501
552
=============
502
553
@@ -528,7 +579,7 @@ nature of every (non-trivial) ML problem can have an effect:
528
579
529
580
pyDVL offers a dedicated :func: `function composition
530
581
<pydvl.utils.types.compose_score> ` for scorer functions which can be used to
531
- squash a score. The following is defined in module :mod: `~pydvl.utils.numeric `:
582
+ squash a score. The following is defined in module :mod: `~pydvl.utils.scorer `:
532
583
533
584
.. code-block :: python
534
585
0 commit comments