Skip to content

Add cosmic database download wrapper #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions bio/cosmic-db/environment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- curl =7
- python >=3.6.2
- requests =2.25
9 changes: 9 additions & 0 deletions bio/cosmic-db/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: cosmic-db
description: |
Download cosmic databases from https://cancer.sanger.ac.uk/cosmic/download
authors:
- Till Hartmann
input:
- genome build, cosmic release version, dataset and file name
Comment on lines +6 to +7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Restructure the input section for clarity and completeness.

The current input description lacks structure and specificity. Consider restructuring it as a YAML list with more detailed information about each parameter, including possible values where applicable.

Here's a suggested improvement based on the information provided in the PR objectives:

input:
  - build:
      description: Genome build version
      choices: [GRCh37, GRCh38]
  - dataset:
      description: COSMIC dataset to download
      choices: [cosmic, cell_lines]
  - version:
      description: COSMIC release version
      choices: [v83, v84, v85, v86, v87, v88, v89, v90, v91, v92, latest]
  - file:
      description: Specific file to download (uses {db} wildcard)

Also, consider mentioning the required environment variables COSMIC_EMAIL and COSMIC_PW in the description or in a separate 'requirements' section.

output:
- database
16 changes: 16 additions & 0 deletions bio/cosmic-db/test/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
envvars:
"COSMIC_EMAIL",
"COSMIC_PW"

rule cosmic_download:
output:
"resources/{db}"
params:
build="GRCh38",
dataset="cosmic",
version="v92",
Comment on lines +9 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider making 'build', 'dataset', and 'version' params more flexible.

The current implementation hardcodes these values, which might limit the reusability of this rule. Consider making these parameters more flexible by using config values or wildcards.

Here's a suggestion for improvement:

params:
    build=config.get("cosmic_build", "GRCh38"),
    dataset=config.get("cosmic_dataset", "cosmic"),
    version=config.get("cosmic_version", "v92"),

This allows users to override these values in their config file while still providing sensible defaults.

file=lambda wc: wc.db # e.g. "CosmicHGNC.tsv.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some comment here explaining the meaning of file? And where to get the possible values from?

log:
"logs/cosmic-db/{db}.log"
wrapper:
"master/bio/cosmic-db"
69 changes: 69 additions & 0 deletions bio/cosmic-db/wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
"""Snakemake wrapper for trimming paired-end reads using cutadapt."""

Comment on lines +1 to +2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update the docstring to reflect the correct functionality

The current docstring mentions trimming paired-end reads using cutadapt, which doesn't match the functionality of this wrapper. It should describe downloading files from the COSMIC database.

Apply this diff to correct the docstring:

-"""Snakemake wrapper for trimming paired-end reads using cutadapt."""
+"""Snakemake wrapper for downloading files from the COSMIC database."""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Snakemake wrapper for trimming paired-end reads using cutadapt."""
"""Snakemake wrapper for downloading files from the COSMIC database."""

__author__ = "Till Hartmann"
__copyright__ = "Copyright 2021, Till Hartmann"
__email__ = "till.hartmann@udo.edu"
__license__ = "MIT"

import requests
import os
from typing import List
from snakemake.shell import shell

email = os.environ["COSMIC_EMAIL"]
password = os.environ["COSMIC_PW"]
assert email, "$COSMIC_EMAIL is not set"
assert password, "$COSMIC_PW is not set"
Comment on lines +13 to +16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle missing environment variables to avoid KeyError

Accessing environment variables with os.environ["VAR"] will raise a KeyError if the variable is not set. Use os.environ.get("VAR") and handle the case where it returns None or an empty string.

Apply this diff to improve error handling:

-email = os.environ["COSMIC_EMAIL"]
-password = os.environ["COSMIC_PW"]
-assert email, "$COSMIC_EMAIL is not set"
-assert password, "$COSMIC_PW is not set"
+email = os.environ.get("COSMIC_EMAIL")
+password = os.environ.get("COSMIC_PW")
+if not email:
+    raise EnvironmentError("$COSMIC_EMAIL environment variable is not set")
+if not password:
+    raise EnvironmentError("$COSMIC_PW environment variable is not set")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
email = os.environ["COSMIC_EMAIL"]
password = os.environ["COSMIC_PW"]
assert email, "$COSMIC_EMAIL is not set"
assert password, "$COSMIC_PW is not set"
email = os.environ.get("COSMIC_EMAIL")
password = os.environ.get("COSMIC_PW")
if not email:
raise EnvironmentError("$COSMIC_EMAIL environment variable is not set")
if not password:
raise EnvironmentError("$COSMIC_PW environment variable is not set")


COSMIC_URL = "https://cancer.sanger.ac.uk/cosmic/file_download"


def available_builds() -> List[str]:
builds = requests.get(COSMIC_URL).json()
return builds

Comment on lines +21 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to available_builds function

The available_builds function performs an HTTP request without checking for errors. If the request fails or returns a non-200 status code, it could cause exceptions or return invalid data.

Apply this diff to handle potential errors:

 def available_builds() -> List[str]:
-    builds = requests.get(COSMIC_URL).json()
+    response = requests.get(COSMIC_URL)
+    response.raise_for_status()
+    builds = response.json()
     return builds
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def available_builds() -> List[str]:
builds = requests.get(COSMIC_URL).json()
return builds
def available_builds() -> List[str]:
response = requests.get(COSMIC_URL)
response.raise_for_status()
builds = response.json()
return builds


def available_datasets(build: str) -> List[str]:
datasets = requests.get(f"{COSMIC_URL}/{build}").json()
return [d.rpartition("/")[-1] for d in datasets]

Comment on lines +26 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to available_datasets function

Similar to available_builds, the available_datasets function should handle potential HTTP errors to ensure robustness.

Apply this diff:

 def available_datasets(build: str) -> List[str]:
-    datasets = requests.get(f"{COSMIC_URL}/{build}").json()
+    response = requests.get(f"{COSMIC_URL}/{build}")
+    response.raise_for_status()
+    datasets = response.json()
     return [d.rpartition("/")[-1] for d in datasets]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def available_datasets(build: str) -> List[str]:
datasets = requests.get(f"{COSMIC_URL}/{build}").json()
return [d.rpartition("/")[-1] for d in datasets]
def available_datasets(build: str) -> List[str]:
response = requests.get(f"{COSMIC_URL}/{build}")
response.raise_for_status()
datasets = response.json()
return [d.rpartition("/")[-1] for d in datasets]


def available_versions(build: str, dataset: str) -> List[str]:
versions = requests.get(f"{COSMIC_URL}/{build}/{dataset}").json()
return [v.rpartition("/")[-1] for v in versions]

Comment on lines +31 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to available_versions function

The available_versions function should check the HTTP response for errors to prevent unexpected exceptions.

Apply this diff:

 def available_versions(build: str, dataset: str) -> List[str]:
-    versions = requests.get(f"{COSMIC_URL}/{build}/{dataset}").json()
+    response = requests.get(f"{COSMIC_URL}/{build}/{dataset}")
+    response.raise_for_status()
+    versions = response.json()
     return [v.rpartition("/")[-1] for v in versions]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def available_versions(build: str, dataset: str) -> List[str]:
versions = requests.get(f"{COSMIC_URL}/{build}/{dataset}").json()
return [v.rpartition("/")[-1] for v in versions]
def available_versions(build: str, dataset: str) -> List[str]:
response = requests.get(f"{COSMIC_URL}/{build}/{dataset}")
response.raise_for_status()
versions = response.json()
return [v.rpartition("/")[-1] for v in versions]


def available_files(build: str, dataset: str, version: str) -> List[str]:
files = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}").json()
return [f.rpartition("/")[-1] for f in files]

Comment on lines +36 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to available_files function

Include error handling for HTTP requests in the available_files function to handle potential request failures.

Apply this diff:

 def available_files(build: str, dataset: str, version: str) -> List[str]:
-    files = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}").json()
+    response = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}")
+    response.raise_for_status()
+    files = response.json()
     return [f.rpartition("/")[-1] for f in files]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def available_files(build: str, dataset: str, version: str) -> List[str]:
files = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}").json()
return [f.rpartition("/")[-1] for f in files]
def available_files(build: str, dataset: str, version: str) -> List[str]:
response = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}")
response.raise_for_status()
files = response.json()
return [f.rpartition("/")[-1] for f in files]


def download_path(build: str, dataset: str, version: str, file: str) -> str:
return f"{COSMIC_URL}/{build}/{dataset}/{version}/{file}"


build = snakemake.params.get("build", "")
dataset = snakemake.params.get("dataset", "")
version = snakemake.params.get("version", "")
file = snakemake.params.get("file", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

builds = available_builds()
assert build in builds, f"{build} is not available. Choose one of: {builds}."

datasets = available_datasets(build)
assert dataset in datasets, f"{dataset} is not available. Choose one of: {datasets}."

versions = available_versions(build, dataset)
assert version in versions, f"{version} is not available. Choose one of: {versions}."

files = available_files(build, dataset, version)
assert file in files, f"{file} is not available. Choose one of: {files}."

download_url = (
requests.get(download_path(build, dataset, version, file), auth=(email, password))
.json()["url"]
.strip()
)
Comment on lines +63 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling when retrieving the download URL

The code assumes the HTTP request is successful and the response contains a 'url' key. If the request fails or the 'url' key is missing, it could cause an exception. Please add error handling to manage these cases.

Apply this diff to improve robustness:

 download_url = (
-    requests.get(download_path(build, dataset, version, file), auth=(email, password))
-    .json()["url"]
-    .strip()
+    response = requests.get(download_path(build, dataset, version, file), auth=(email, password))
+    response.raise_for_status()
+    json_response = response.json()
+    download_url = json_response.get("url", "").strip()
+    if not download_url:
+        raise Exception("Failed to retrieve download URL from the response.")
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
download_url = (
requests.get(download_path(build, dataset, version, file), auth=(email, password))
.json()["url"]
.strip()
)
download_url = (
response = requests.get(download_path(build, dataset, version, file), auth=(email, password))
response.raise_for_status()
json_response = response.json()
download_url = json_response.get("url", "").strip()
if not download_url:
raise Exception("Failed to retrieve download URL from the response.")
)


shell('curl "{download_url}" -o {snakemake.output[0]} {log}')
7 changes: 7 additions & 0 deletions test.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,13 @@ def run(wrapper, cmd, check_log=None):
os.chdir(origdir)


@skip_if_not_modified
def test_download_cosmic_db():
run(
"bio/cosmic-db",
["snakemake", "--cores", "1", "--use-conda", "resources/CosmicHGNC.tsv.gz"],
)

@skip_if_not_modified
def test_biobambam2_bamsormadup():
run(
Expand Down
Loading