Skip to content

Commit 0e929ae

Browse files
Merge branch 'release/v0.6.1'
2 parents f8e07cc + ce29e0c commit 0e929ae

31 files changed

+1311
-680
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.6.0
2+
current_version = 0.6.1
33
commit = False
44
tag = False
55
allow_dirty = False

.github/workflows/publish.yaml

Lines changed: 31 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Upload Python Package to PyPI
1+
name: Publish Python Package to PyPI
22

33
on:
44
push:
@@ -10,32 +10,52 @@ on:
1010
description: Why did you trigger the pipeline?
1111
required: False
1212
default: Check if it runs again due to external changes
13+
tag:
14+
description: Tag for which a package should be published
15+
type: string
16+
required: false
1317

1418
env:
1519
PY_COLORS: 1
1620

1721
jobs:
18-
deploy:
22+
publish:
1923
runs-on: ubuntu-latest
2024
concurrency:
21-
group: deploy
25+
group: publish
2226
steps:
2327
- uses: actions/checkout@v3
2428
with:
2529
fetch-depth: 0
26-
- name: Fail if manually triggered workflow is not on 'master' branch
27-
if: github.event_name == 'workflow_dispatch' && github.ref_name != 'master'
28-
run: exit -1
30+
- name: Fail if manually triggered workflow does not have 'tag' input
31+
if: github.event_name == 'workflow_dispatch' && inputs.tag == ''
32+
run: |
33+
echo "Input 'tag' should not be empty"
34+
exit -1
35+
- name: Extract branch name from input
36+
id: get_branch_name_input
37+
if: github.event_name == 'workflow_dispatch'
38+
run: |
39+
export BRANCH_NAME=$(git log -1 --format='%D' ${{ inputs.tag }} | sed -e 's/.*origin\/\(.*\).*/\1/')
40+
echo "branch_name=${BRANCH_NAME}" >> $GITHUB_OUTPUT
2941
- name: Extract branch name from tag
30-
id: get_branch_name
42+
id: get_branch_name_tag
3143
if: github.ref_type == 'tag'
3244
run: |
33-
export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | sed -e 's/.*origin\/\(.*\),.*/\1/')
34-
echo ::set-output name=branch_name::${BRANCH_NAME}
45+
export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | sed -e 's/.*origin\/\(.*\).*/\1/')
46+
echo "branch_name=${BRANCH_NAME}" >> $GITHUB_OUTPUT
3547
shell: bash
3648
- name: Fail if tag is not on 'master' branch
37-
if: github.ref_type == 'tag' && steps.get_branch_name.outputs.branch_name != 'master'
38-
run: exit -1
49+
if: ${{ steps.get_branch_name_tag.outputs.branch_name != 'master' && steps.get_branch_name_input.outputs.branch_name != 'master' }}
50+
run: |
51+
echo "Tag is on branch ${{ steps.get_branch_name.outputs.branch_name }}"
52+
echo "Should be on Master branch instead"
53+
exit -1
54+
- name: Fail if running locally
55+
if: ${{ !github.event.act }} # skip during local actions testing
56+
run: |
57+
echo "Running action locally. Failing"
58+
exit -1
3959
- name: Set up Python 3.8
4060
uses: actions/setup-python@v4
4161
with:

CHANGELOG.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
# Changelog
22

3+
## 0.6.1 - 🏗 Bug fixes and small improvement
4+
5+
- Fix parsing keyword arguments of `compute_semivalues` dispatch function
6+
[PR #333](https://github.com/appliedAI-Initiative/pyDVL/pull/333)
7+
- Create new `RayExecutor` class based on the concurrent.futures API,
8+
use the new class to fix an issue with Truncated Monte Carlo Shapley
9+
(TMCS) starting too many processes and dying, plus other small changes
10+
[PR #329](https://github.com/appliedAI-Initiative/pyDVL/pull/329)
11+
- Fix creation of GroupedDataset objects using the `from_arrays`
12+
and `from_sklearn` class methods
13+
[PR #324](https://github.com/appliedAI-Initiative/pyDVL/pull/334)
14+
- Fix release job not triggering on CI when a new tag is pushed
15+
[PR #331](https://github.com/appliedAI-Initiative/pyDVL/pull/331)
16+
- Added alias `ApproShapley` from Castro et al. 2009 for permutation Shapley
17+
[PR #332](https://github.com/appliedAI-Initiative/pyDVL/pull/332)
18+
319
## 0.6.0 - 🆕 New algorithms, cleanup and bug fixes 🏗
420

521
- Fixes in `ValuationResult`: bugs around data names, semantics of
@@ -8,8 +24,8 @@
824
- **New method**: Implements generalised semi-values for data valuation,
925
including Data Banzhaf and Beta Shapley, with configurable sampling strategies
1026
[PR #319](https://github.com/appliedAI-Initiative/pyDVL/pull/319)
11-
- Adds kwargs parameter to `from_array` and `from_sklearn`
12-
Dataset and GroupedDataset class methods
27+
- Adds kwargs parameter to `from_array` and `from_sklearn` Dataset and
28+
GroupedDataset class methods
1329
[PR #316](https://github.com/appliedAI-Initiative/pyDVL/pull/316)
1430
- PEP-561 conformance: added `py.typed`
1531
[PR #307](https://github.com/appliedAI-Initiative/pyDVL/pull/307)

CONTRIBUTING.md

Lines changed: 103 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,102 @@ sizeable amount of time, so care must be taken not to overdo it:
261261
2. We try not to trigger CI pipelines when unnecessary (see [Skipping CI
262262
runs](#skipping-ci-runs)).
263263

264+
### Running Github Actions locally
265+
266+
To run Github Actions locally we use [act](https://github.com/nektos/act).
267+
It uses the workflows defined in `.github/workflows` and determines
268+
the set of actions that need to be run. It uses the Docker API
269+
to either pull or build the necessary images, as defined
270+
in our workflow files and finally determines the execution path
271+
based on the dependencies that were defined.
272+
273+
Once it has the execution path, it then uses the Docker API
274+
to run containers for each action based on the images prepared earlier.
275+
The [environment variables](https://docs.github.com/en/actions/learn-github-actions/variables#default-environment-variables)
276+
and [filesystem](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#file-systems)
277+
are all configured to match what GitHub provides.
278+
279+
You can install it manually using:
280+
281+
```shell
282+
curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash -s -- -d -b ~/bin
283+
```
284+
285+
And then simply add it to your PATH variable: `PATH=~/bin:$PATH`
286+
287+
Refer to its official
288+
[readme](https://github.com/nektos/act#installation-through-package-managers)
289+
for more installation options.
290+
291+
#### Cheatsheat
292+
293+
```shell
294+
# List all actions for all events:
295+
act -l
296+
297+
# List the actions for a specific event:
298+
act workflow_dispatch -l
299+
300+
# List the actions for a specific job:
301+
act -j lint -l
302+
303+
# Run the default (`push`) event:
304+
act
305+
306+
# Run a specific event:
307+
act pull_request
308+
309+
# Run a specific job:
310+
act -j lint
311+
312+
# Collect artifacts to the /tmp/artifacts folder:
313+
act --artifact-server-path /tmp/artifacts
314+
315+
# Run a job in a specific workflow (useful if you have duplicate job names)
316+
act -j lint -W .github/workflows/tox.yml
317+
318+
# Run in dry-run mode:
319+
act -n
320+
321+
# Enable verbose-logging (can be used with any of the above commands)
322+
act -v
323+
```
324+
325+
#### Example
326+
327+
To run the `publish` job (the toughest one to test) with tag 'v0.6.0'
328+
you would simply use:
329+
330+
```shell
331+
act push -j publish --eventpath events.json
332+
```
333+
334+
With `events.json` containing:
335+
336+
```json
337+
{
338+
"ref": "refs/tags/v0.6.0"
339+
}
340+
```
341+
342+
To instead run it as if it had been manually triggered (i.e. `workflow_dispatch`)
343+
344+
you would instead use:
345+
346+
```shell
347+
act workflow_dispatch -j publish --eventpath events.json
348+
```
349+
350+
With `events.json` containing:
351+
352+
```json
353+
{
354+
"inputs": {
355+
"tag": "v0.6.0"
356+
}
357+
}
358+
```
359+
264360
### Skipping CI runs
265361

266362
One sometimes would like to skip CI for certain commits (e.g. updating the
@@ -348,10 +444,10 @@ create a new release manually by following these steps:
348444
8. Pour yourself a cup of coffee, you earned it! :coffee: :sparkles:
349445
9. A package will be automatically created and published from CI to PyPI.
350446

351-
### CI and requirements for releases
447+
### CI and requirements for publishing
352448

353-
In order to release new versions of the package from the development branch, the
354-
CI pipeline requires the following secret variables set up:
449+
In order to publish new versions of the package from the development branch,
450+
the CI pipeline requires the following secret variables set up:
355451

356452
```
357453
TEST_PYPI_USERNAME
@@ -367,13 +463,13 @@ The last 2 are used in the [publish.yaml](.github/workflows/publish.yaml) CI
367463
workflow to publish packages to [PyPI](https://pypi.org/) from `develop` after
368464
a GitHub release.
369465
370-
#### Release to TestPyPI
466+
#### Publish to TestPyPI
371467
372-
We use [bump2version](https://pypi.org/project/bump2version/) to bump the build
373-
part of the version number, create a tag and push it from CI.
468+
We use [bump2version](https://pypi.org/project/bump2version/) to bump
469+
the build part of the version number and publish a package to TestPyPI from CI.
374470
375471
To do that, we use 2 different tox environments:
376472
377473
- **bump-dev-version**: Uses bump2version to bump the dev version,
378-
without committing the new version or creating a corresponding git tag.
474+
without committing the new version or creating a corresponding git tag.
379475
- **publish-test-package**: Builds and publishes a package to TestPyPI

README.md

Lines changed: 29 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -32,42 +32,47 @@ Data Valuation is the task of estimating the intrinsic value of a data point
3232
wrt. the training set, the model and a scoring function. We currently implement
3333
methods from the following papers:
3434

35-
- Ghorbani, Amirata, and James Zou.
36-
[Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
37-
In International Conference on Machine Learning, 2242–51. PMLR, 2019.
35+
- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
36+
Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
37+
Computers & Operations Research, Selected papers presented at the Tenth
38+
International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1,
39+
2009): 1726–30.
40+
- Ghorbani, Amirata, and James Zou. [Data Shapley: Equitable Valuation of Data
41+
for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html). In
42+
International Conference on Machine Learning, 2242–51. PMLR, 2019.
3843
- Wang, Tianhao, Yu Yang, and Ruoxi Jia.
39-
[Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning](https://doi.org/10.48550/arXiv.2107.06336).
40-
arXiv, 2022.
41-
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li,
42-
Ce Zhang, Costas Spanos, and Dawn Song.
43-
[Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
44+
[Improving Cooperative Game Theory-Based Data Valuation via Data Utility
45+
Learning](https://doi.org/10.48550/arXiv.2107.06336). arXiv, 2022.
46+
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo
47+
Li, Ce Zhang, Costas Spanos, and Dawn Song. [Efficient Task-Specific Data
48+
Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
4449
Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
45-
- Okhrati, Ramin, and Aldo Lipani.
46-
[A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
47-
In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99.
48-
IEEE, 2021.
49-
- Yan, T., & Procaccia, A. D.
50-
[If You Like Shapley Then You’ll Love the Core]().
51-
Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
50+
- Okhrati, Ramin, and Aldo Lipani. [A Multilinear Sampling Algorithm to Estimate
51+
Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511). In 25th
52+
International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE,
53+
2021.
54+
- Yan, T., & Procaccia, A. D. [If You Like Shapley Then You’ll Love the
55+
Core](https://ojs.aaai.org/index.php/AAAI/article/view/16721). Proceedings of
56+
the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
5257
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
53-
Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos.
54-
[Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
58+
Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. [Towards Efficient
59+
Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
5560
In 22nd International Conference on Artificial Intelligence and Statistics,
5661
1167–76. PMLR, 2019.
57-
- Wang, Jiachen T., and Ruoxi Jia.
58-
[Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
62+
- Wang, Jiachen T., and Ruoxi Jia. [Data Banzhaf: A Robust Data Valuation
63+
Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
5964
arXiv, October 22, 2022.
60-
- Kwon, Yongchan, and James Zou.
61-
[Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
65+
- Kwon, Yongchan, and James Zou. [Beta Shapley: A Unified and Noise-Reduced Data
66+
Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
6267
In Proceedings of the 25th International Conference on Artificial Intelligence
6368
and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
6469

6570
Influence Functions compute the effect that single points have on an estimator /
6671
model. We implement methods from the following papers:
6772

68-
- Koh, Pang Wei, and Percy Liang.
69-
[Understanding Black-Box Predictions via Influence Functions](http://proceedings.mlr.press/v70/koh17a.html).
70-
In Proceedings of the 34th International Conference on Machine Learning,
73+
- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
74+
Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
75+
Proceedings of the 34th International Conference on Machine Learning,
7176
70:1885–94. Sydney, Australia: PMLR, 2017.
7277

7378
# Installation

docs/30-data-valuation.rst

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -314,9 +314,8 @@ values in pyDVL. First construct the dataset and utility, then call
314314
u=utility, mode="owen", n_iterations=4, max_q=200
315315
)
316316
317-
There are more details on Owen
318-
sampling, and its variant *Antithetic Owen Sampling* in the documentation for the
319-
function doing the work behind the scenes:
317+
There are more details on Owen sampling, and its variant *Antithetic Owen
318+
Sampling* in the documentation for the function doing the work behind the scenes:
320319
:func:`~pydvl.value.shapley.montecarlo.owen_sampling_shapley`.
321320

322321
Note that in this case we do not pass a
@@ -327,20 +326,26 @@ integration.
327326
Permutation Shapley
328327
^^^^^^^^^^^^^^^^^^^
329328

330-
An equivalent way of computing Shapley values appears often in the literature.
331-
It uses permutations over indices instead of subsets:
329+
An equivalent way of computing Shapley values (``ApproShapley``) appeared in
330+
:footcite:t:`castro_polynomial_2009` and is the basis for the method most often
331+
used in practice. It uses permutations over indices instead of subsets:
332332

333333
$$
334334
v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
335335
[u(\sigma_{:i} \cup \{i\}) − u(\sigma_{:i})]
336336
,$$
337337

338338
where $\sigma_{:i}$ denotes the set of indices in permutation sigma before the
339-
position where $i$ appears. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
340-
one uses Monte Carlo sampling of permutations, something which has surprisingly
341-
low sample complexity. By adding early stopping, the result is the so-called
342-
**Truncated Monte Carlo Shapley** (:footcite:t:`ghorbani_data_2019`), which is
343-
efficient enough to be useful in some applications.
339+
position where $i$ appears. To approximate this sum (which has $\mathcal{O}(n!)$
340+
terms!) one uses Monte Carlo sampling of permutations, something which has
341+
surprisingly low sample complexity. One notable difference wrt. the
342+
combinatorial approach above is that the approximations always fulfill the
343+
efficiency axiom of Shapley, namely $\sum_{i=1}^n \hat{v}_i = u(D)$ (see
344+
:footcite:t:`castro_polynomial_2009`, Proposition 3.2).
345+
346+
By adding early stopping, the result is the so-called **Truncated Monte Carlo
347+
Shapley** (:footcite:t:`ghorbani_data_2019`), which is efficient enough to be
348+
useful in applications.
344349

345350
.. code-block:: python
346351

0 commit comments

Comments
 (0)