Skip to content

Commit e1d28ef

Browse files
committed
Merge branch 'release/v0.5.0'
2 parents 3dedb5a + 52a6e61 commit e1d28ef

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+3758
-1888
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.4.0
2+
current_version = 0.5.0
33
commit = False
44
tag = False
55
allow_dirty = False

.github/workflows/publish.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
id: get_branch_name
3131
if: github.ref_type == 'tag'
3232
run: |
33-
export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | | sed -e 's/.*origin\/\(.*\),.*/\1/')
33+
export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | sed -e 's/.*origin\/\(.*\),.*/\1/')
3434
echo ::set-output name=branch_name::${BRANCH_NAME}
3535
shell: bash
3636
- name: Fail if tag is not on 'master' branch

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ repos:
77
- id: black-jupyter
88
language_version: python3
99
- repo: https://github.com/PyCQA/isort
10-
rev: 5.10.1
10+
rev: 5.12.0
1111
hooks:
1212
- id: isort
1313
- repo: https://github.com/kynan/nbstripout

CHANGELOG.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,33 @@
11
# Changelog
22

3+
## 0.5.0 - 💥 Fixes, nicer interfaces and... more breaking changes 😒
4+
5+
- Fixed parallel and antithetic Owen sampling for Shapley values. Simplified
6+
and extended tests.
7+
[PR #267](https://github.com/appliedAI-Initiative/pyDVL/pull/267)
8+
- Added `Scorer` class for a cleaner interface. Fixed minor bugs around
9+
Group-Testing Shapley, added more tests and switched to cvxpy for the solver.
10+
[PR #264](https://github.com/appliedAI-Initiative/pyDVL/pull/264)
11+
- Generalised stopping criteria for valuation algorithms. Improved classes
12+
`ValuationResult` and `Status` with more operations. Some minor issues fixed.
13+
[PR #252](https://github.com/appliedAI-Initiative/pyDVL/pull/250)
14+
- Fixed a bug whereby `compute_shapley_values` would only spawn one process when
15+
using `n_jobs=-1` and Monte Carlo methods.
16+
[PR #270](https://github.com/appliedAI-Initiative/pyDVL/pull/270)
17+
- Bugfix in `RayParallelBackend`: wrong semantics for `kwargs`.
18+
[PR #268](https://github.com/appliedAI-Initiative/pyDVL/pull/268)
19+
- Splitting of problem preparation and solution in Least-Core computation.
20+
Umbrella function for LC methods.
21+
[PR #257](https://github.com/appliedAI-Initiative/pyDVL/pull/257)
22+
- Operations on `ValuationResult` and `Status` and some cleanup
23+
[PR #248](https://github.com/appliedAI-Initiative/pyDVL/pull/248)
24+
- **Bug fix and minor improvements**: Fixes bug in TMCS with remote Ray cluster,
25+
raises an error for dummy sequential parallel backend with TMCS, clones model
26+
inside `Utility` before fitting by default, with flag `clone_before_fit`
27+
to disable it, catches all warnings in `Utility` when `show_warnings` is
28+
`False`. Adds Miner and Gloves toy games utilities
29+
[PR #247](https://github.com/appliedAI-Initiative/pyDVL/pull/247)
30+
331
## 0.4.0 - 🏭💥 New algorithms and more breaking changes
432

533
- GH action to mark issues as stale
@@ -11,8 +39,8 @@
1139
- **Breaking change:** Introduces a class ValuationResult to gather and inspect
1240
results from all valuation algorithms
1341
[PR #214](https://github.com/appliedAI-Initiative/pyDVL/pull/214)
14-
- Fixes bug in Influence calculation with multi-dimensional input and adds
15-
new example notebook
42+
- Fixes bug in Influence calculation with multidimensional input and adds new
43+
example notebook
1644
[PR #195](https://github.com/appliedAI-Initiative/pyDVL/pull/195)
1745
- **Breaking change**: Passes the input to `MapReduceJob` at initialization,
1846
removes `chunkify_inputs` argument from `MapReduceJob`, removes `n_runs`

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ Data Shapley values:
9999
```python
100100
import numpy as np
101101
from pydvl.utils import Dataset, Utility
102-
from pydvl.value.shapley import compute_shapley_values
102+
from pydvl.value import *
103103
from sklearn.linear_model import LinearRegression
104104
from sklearn.model_selection import train_test_split
105105

@@ -111,7 +111,7 @@ dataset = Dataset(X_train, y_train, X_test, y_test)
111111
model = LinearRegression()
112112
utility = Utility(model, dataset)
113113
values = compute_shapley_values(
114-
u=utility, n_iterations=100, mode="truncated_montecarlo"
114+
u=utility, mode="truncated_montecarlo", done=MaxUpdates(100)
115115
)
116116
```
117117

build_scripts/update_docs.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,6 @@ def module_template(module_qualname: str):
2424
:undoc-members:
2525
2626
----
27-
28-
Module members
29-
==============
3027
3128
.. footbibliography::
3229

docs/30-data-valuation.rst

Lines changed: 82 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,34 @@ is implemented, it is important not to reuse `Utility` objects for different
118118
datasets. You can read more about :ref:`caching setup` in the installation guide
119119
and the documentation of the :mod:`pydvl.utils.caching` module.
120120

121+
Using custom scorers
122+
^^^^^^^^^^^^^^^^^^^^
123+
124+
The `scoring` argument of :class:`~pydvl.utils.utility.Utility` can be used to
125+
specify a custom :class:`~pydvl.utils.utility.Scorer` object. This is a simple
126+
wrapper for a callable that takes a model, and test data and returns a score.
127+
128+
More importantly, the object provides information about the range of the score,
129+
which is used by some methods by estimate the number of samples necessary, and
130+
about what default value to use when the model fails to train.
131+
132+
.. note::
133+
The most important property of a `Scorer` is its default value. Because many
134+
models will fail to fit on small subsets of the data, it is important to
135+
provide a sensible default value for the score.
136+
137+
It is possible to skip the construction of the :class:`~pydvl.utils.utility.Scorer`
138+
when constructing the `Utility` object. The two following calls are equivalent:
139+
140+
.. code-block:: python
141+
142+
utility = Utility(
143+
model, dataset, "explained_variance", score_range=(-np.inf, 1), default_score=0.0
144+
)
145+
utility = Utility(
146+
model, dataset, Scorer("explained_variance", range=(-np.inf, 1), default=0.0)
147+
)
148+
121149
Learning the utility
122150
^^^^^^^^^^^^^^^^^^^^
123151

@@ -174,7 +202,7 @@ definitions, but other methods are typically preferable.
174202
values = naive_loo(utility)
175203
176204
The return value of all valuation functions is an object of type
177-
:class:`~pydvl.value.results.ValuationResult`. This can be iterated over,
205+
:class:`~pydvl.value.result.ValuationResult`. This can be iterated over,
178206
indexed with integers, slices and Iterables, as well as converted to a
179207
`pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_.
180208

@@ -217,11 +245,11 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
217245
values = compute_shapley_values(utility, mode="combinatorial_exact")
218246
df = values.to_dataframe(column='value')
219247
220-
We convert the return value to a
248+
We can convert the return value to a
221249
`pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_
222250
and name the column with the results as `value`. Please refer to the
223251
documentation in :mod:`pydvl.value.shapley` and
224-
:class:`~pydvl.value.results.ValuationResult` for more information.
252+
:class:`~pydvl.value.result.ValuationResult` for more information.
225253

226254
Monte Carlo Combinatorial Shapley
227255
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -240,12 +268,19 @@ same pattern:
240268
model = ...
241269
data = Dataset(...)
242270
utility = Utility(model, data)
243-
values = compute_shapley_values(utility, mode="combinatorial_montecarlo")
271+
values = compute_shapley_values(
272+
utility, mode="combinatorial_montecarlo", done=MaxUpdates(1000)
273+
)
244274
df = values.to_dataframe(column='cmc')
245275
246276
The DataFrames returned by most Monte Carlo methods will contain approximate
247277
standard errors as an additional column, in this case named `cmc_stderr`.
248278

279+
Note the usage of the object :class:`~pydvl.value.stopping.MaxUpdates` as the
280+
stop condition. This is an instance of a
281+
:class:`~pydvl.value.stopping.StoppingCriterion`. Other examples are
282+
:class:`~pydvl.value.stopping.MaxTime` and :class:`~pydvl.value.stopping.StandardError`.
283+
249284

250285
Owen sampling
251286
^^^^^^^^^^^^^
@@ -281,6 +316,10 @@ sampling, and its variant *Antithetic Owen Sampling* in the documentation for th
281316
function doing the work behind the scenes:
282317
:func:`~pydvl.value.shapley.montecarlo.owen_sampling_shapley`.
283318

319+
Note that in this case we do not pass a
320+
:class:`~pydvl.value.stopping.StoppingCriterion` to the function, but instead
321+
the number of iterations and the maximum number of samples to use in the
322+
integration.
284323

285324
Permutation Shapley
286325
^^^^^^^^^^^^^^^^^^^
@@ -309,7 +348,7 @@ efficient enough to be useful in some applications.
309348
data = Dataset(...)
310349
utility = Utility(model, data)
311350
values = compute_shapley_values(
312-
u=utility, mode="truncated_montecarlo", n_iterations=100
351+
u=utility, mode="truncated_montecarlo", done=MaxUpdates(1000)
313352
)
314353
315354
@@ -358,14 +397,15 @@ $$
358397
but we don't advocate its use because of the speed and memory cost. Despite
359398
our best efforts, the number of samples required in practice for convergence
360399
can be several orders of magnitude worse than with e.g. Truncated Monte Carlo.
400+
Additionally, the CSP can sometimes turn out to be infeasible.
361401

362402
Usage follows the same pattern as every other Shapley method, but with the
363-
addition of an ``eps`` parameter required for the solution of the CSP. It should
364-
be the same value used to compute the minimum number of samples required. This
365-
can be done with :func:`~pydvl.value.shapley.gt.num_samples_eps_delta`, but note
366-
that the number returned will be huge! In practice, fewer samples can be enough,
367-
but the actual number will strongly depend on the utility, in particular its
368-
variance.
403+
addition of an ``epsilon`` parameter required for the solution of the CSP. It
404+
should be the same value used to compute the minimum number of samples required.
405+
This can be done with :func:`~pydvl.value.shapley.gt.num_samples_eps_delta`, but
406+
note that the number returned will be huge! In practice, fewer samples can be
407+
enough, but the actual number will strongly depend on the utility, in particular
408+
its variance.
369409

370410
.. code-block:: python
371411
@@ -459,29 +499,18 @@ Monte Carlo Least Core
459499
Because the number of subsets $S \subseteq D \setminus \{x_i\}$ is
460500
$2^{ | D | - 1 }$, one typically must resort to approximations.
461501

462-
The simplest approximation consists of two relaxations of the Least Core
463-
(:footcite:t:`yan_if_2021`):
464-
465-
- Further relaxing the coalitional rationality property by
466-
a constant value $\epsilon > 0$:
467-
468-
$$
469-
\sum_{x_i\in S} v_u(x_i) + e + \epsilon \geq u(S)
470-
$$
471-
472-
- Using a fraction of all subsets instead of all possible subsets.
473-
474-
Combined, this gives us the $(\epsilon, \delta)$-*probably approx-
475-
imate least core* that satisfies the following property:
502+
The simplest approximation consists in using a fraction of all subsets for the
503+
constraints. :footcite:t:`yan_if_2021` show that a quantity of order
504+
$\mathcal{O}((n - \log \Delta ) / \delta^2)$ is enough to obtain a so-called
505+
$\delta$-*approximate least core* with high probability. I.e. the following
506+
property holds with probability $1-\Delta$ over the choice of subsets:
476507

477508
$$
478-
P_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} + \epsilon \geq u(S)\right]
479-
\geq 1 - \delta
509+
\mathbb{P}_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} \geq u(S)\right]
510+
\geq 1 - \delta,
480511
$$
481512

482-
Where $e^{*}$ is the optimal least core subsidy.
483-
484-
With these relaxations, we obtain a polynomial running time.
513+
where $e^{*}$ is the optimal least core subsidy.
485514

486515
.. code-block:: python
487516
@@ -497,6 +526,28 @@ With these relaxations, we obtain a polynomial running time.
497526

498527
``n_iterations`` needs to be at least equal to the number of data points.
499528

529+
Because computing the Least Core values requires the solution of a linear and a
530+
quadratic problem *after* computing all the utility values, we offer the
531+
possibility of splitting the latter from the former. This is useful when running
532+
multiple experiments: use
533+
:func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem` to prepare a
534+
list of problems to solve, then solve them in parallel with
535+
:func:`~pydvl.value.least_core.common.lc_solve_problems`.
536+
537+
.. code-block:: python
538+
539+
from pydvl.utils import Dataset, Utility
540+
from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
541+
model = ...
542+
dataset = Dataset(...)
543+
n_iterations = ...
544+
utility = Utility(data, model)
545+
n_experiments = 10
546+
problems = [mclc_prepare_problem(utility, n_iterations=n_iterations)
547+
for _ in range(n_experiments)]
548+
values = lc_solve_problems(problems)
549+
550+
500551
Other methods
501552
=============
502553

@@ -528,7 +579,7 @@ nature of every (non-trivial) ML problem can have an effect:
528579

529580
pyDVL offers a dedicated :func:`function composition
530581
<pydvl.utils.types.compose_score>` for scorer functions which can be used to
531-
squash a score. The following is defined in module :mod:`~pydvl.utils.numeric`:
582+
squash a score. The following is defined in module :mod:`~pydvl.utils.scorer`:
532583

533584
.. code-block:: python
534585

requirements-dev.txt

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
black[jupyter] == 22.10.0
2-
isort == 5.10.1
2+
isort == 5.12.0
33
jupyter
44
mypy == 0.982
5-
nbconvert
5+
nbconvert>=7.2.9
66
nbstripout == 0.6.1
77
bump2version
8-
pre-commit == 2.20.0
9-
pytest
8+
pre-commit==3.0.4
9+
pytest==7.2.1
1010
pytest-cov
11+
pytest-docker==0.12.0
1112
pytest-mock
1213
pytest-timeout
1314
ray[default] >= 0.8

requirements-notebooks.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
torch==1.13.1
22
torchvision==0.14.1
33
datasets==2.6.1
4-
Pillow==9.2.0
4+
pillow==9.3.0

setup.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
package_dir={"": "src"},
1212
packages=find_packages(where="src"),
1313
include_package_data=True,
14-
version="0.4.0",
14+
version="0.5.0",
1515
description="The Python Data Valuation Library",
1616
install_requires=[
1717
line
@@ -20,9 +20,7 @@
2020
],
2121
setup_requires=["wheel"],
2222
tests_require=["pytest"],
23-
extras_require={
24-
"influence": ["torch"],
25-
},
23+
extras_require={"influence": ["torch"]},
2624
author="appliedAI Institute gGmbH",
2725
long_description=long_description,
2826
long_description_content_type="text/markdown",
@@ -41,8 +39,8 @@
4139
"License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)",
4240
],
4341
project_urls={
44-
"Source": "https://appliedAI-Initiative/pydvl",
45-
"Documentation": "https://appliedai-initiative.github.io/pyDVL/",
42+
"Source": "https://github.com/appliedAI-Initiative/pydvl",
43+
"Documentation": "https://appliedai-initiative.github.io/pyDVL",
4644
"TransferLab": "https://transferlab.appliedai.de",
4745
},
4846
)

0 commit comments

Comments
 (0)