Usage#

After installation, the workflow can be run via the ATAC command. To get an overview of the available options, run:

ATAC -h

Required files#

The workflow requires at least the following files/paths to run:

  • A directory containing (assumed to be deduplicated) BAM or CRAM files.

  • A GTF file containing the gene models.

  • A FASTA file containing the reference genome sequence.

  • A bed file containing read attracting regions. This should at least contain the mitochondrial chromosome, but can also contain other regions that you want to be filtered out.

All other command line options are depicted in the section below <all-command-line-options_>.

The default output (which will always be generated) is as such:

  • Directory bw containing RPKM and scalefactor (calculated on signal only) bigwig files for each sample.

  • Directory figures, containing the read statistics before and after filtering (fragmentsizes.png and alignmentsieve.png, respectively). The FRIP scores (fripscores.png), fraction of mitochondrial reads (mitofraction.png) and a PCA plot of the samples.

  • Directory peaks, containing the called peaks for each sample.

  • Directory peakset, containing the unionized peak set, a countmatrix, the calculated scalefactors and the peak-to-gene annotations.

  • Directory qc, contains the numeric data underlying the figures

  • Directory sieve, contains the filtered (blacklist / NFR) BAM files.

Differential accessibility#

Samplesheet#

Differential accessibility analysis can be performed by specifying a samplesheet and a comparison file. The samplesheet should be tab-separated and should (aside from the sample column), contain at least one column with a covariate of interest. Note that per unique combination of covariates, at least two instances (i.e. replicates) need to be present.

sample

factor1

time

s1

WT

0

s2

WT

0

s3

WT

1

s4

WT

1

s5

KO

0

s6

KO

0

s7

KO

1

s8

KO

1

Keep in mind that when a samplesheet is provided, only samples that are present in that samplesheet are considered for the analysis. If no samplesheet is provided, all BAM/CRAM files in the specified directory are included in the analysis.

Comparison#

The actual comparison/analysis to be made needs to be specified in a separate YAML file. It is possible to request multiple analyses in one file. Multiple comparison files in a single run are not possible. Three different types of analyses can be requested, simple two-group comparisons, an LRT test, and a timecourse analysis. Note that all the requested analyses need to have a unique comparison_name. If not, only the last entry will be performed.

Twogroup comparisons#

The twogroup comparison is he most basic comparison to be made. It can be specified as such:

comparison_name:
  type: twogroup
  design: ~factor1*time
  groups:
    my_control_group:
      factor1: WT
    my_treatment_group:
      factor1: KO

design specification in comparison entries is optional. If not specified, the default design will include all factors in the samplesheet, without any interaction terms. In the above example, the design includes an interaction term between factor1 and time (and expands to ~factor1 + time + factor1:time). Allowed characters in the design specification are (aside from the tilde ~) +, * and :. It’s possible to have multiple values for a factor, and it’s also possible to have multiple factors specified per group.

For example this is valid:

comparison_name:
  type: twogroup
  design: ~celltype*drug
  groups:
    celltypes12:
      celltype:
        - celltype1
        - celltype2
    celltypes34:
      celltype:
        - celltype3
        - celltype4

This will detect the difference between celltype1 and celltype2 on one hand, and celltype3 and celltype4 on the other hand. Keep in mind that in the above example, drug is included in the design but not specified in the groups. This means that the effect will be averaged over the different drug conditions. In case you want to restrict the comparison to, for example, DMSO treated samples only, you can specify the following:

comparison_name:
  type: twogroup
  design: ~celltype*drug
  groups:
    celltypes12:
      celltype:
        - celltype1
        - celltype2
      drug: DMSO
    celltypes34:
      celltype:
        - celltype3
        - celltype4
      drug: DMSO

The output for the differential testing will be stored in a folder of their respective type (in this case, twogroup), with subfolder named after the comparison_name (first key in an entry). Included are a table with the results of the differential testing (standard edgeR output), which also includes a column with the assigned group label (based on the sign of the estimated log fold change) and the actual peak ID. Keep in mind that the group label is set irrespective of the FDR, so filter first before you interpret the group labels. Additionally, an MAplot is generated, and potentially a heatmap too (depending on the number of differential sites). The cut offs to call something a differential peak can be set by arguments in the ATAC command, with –fdr_cutoff (1e-3 by default), and –lfc_cutoff (0 by default). An lfc_cutoff of 2 would classify peaks with their estimated fold change either higher than 2 or lower than -2 as differential.

LRT#

A second comparison type that can be requested in the comparison YAML file is an LRT test. This allows a simple way to test multiple factors at the same time. Requesting an LRT can be done as such:

comparison_name:
  type: lrt
  design: ~celltype*drug
  reduced: ~drug

Above design will test for peaks that are differentially accessible between celltypes. If celltype itself has only two levels, this is equivalent to a two-group comparison. However, if there are more than two celltypes, this will test for any difference between the celltypes, without specifying which celltypes are different from each other. Keep in mind that the reduced design should be nested within the full design, meaning that all factors and interaction terms in the reduced design should also be present in the full design.

The output here is similar to the two-group comparison output (though under the lrt subfolder), with the difference that there is no group label in the edgeR output. Instead, a rudimentary k-means clustering is ran on the significant peaks (calculating k based on the inertia metrics). Note that here –fdr_cutoff is relevant, but –lfc_cutoff is not, even though fold changes are estimated per dropped factor level and available in the output table too. By default, if at least 1000 peaks (–min_sigpeaks) are scored differential under the LRT test, postprocessing is ran. This number can be controlled with the –lrt-peaks flag.

Timecourse#

A third comparison type is a timecourse analysis. Two different timecourse modes are possible, with both of them based on Gaussian process regression. Note that timecourse analyses could also be performed by using the LRT comparison type, and dropping the time factor from the reduced design. The assumptions made there are however different: if time is encoded as a continuous variable, the effect is presumed to be monotonic with how time is encoded in the design (linear, quadratic, etc.). If time is encoded as a categorical variable, then the effect is not presumed to be monotonic, but the inherent ordering is not taking into account. To tackle this, the timecourse comparison allows a setup with time being treated as a continuous variable, but without presuming monotonicity. This can be specified as such:

comparison_name:
  type: timecourse
  time: 'time'
  time_type: 'continuous'

Note that here you cannot specify a design, and all covariates in the samplesheet will be included in the analysis by default. Running the above comparison will run a Gaussian process regression per peak, and test whether the result is substantially different from a flat kernel by taking the marginal likelihood ratio between them. Since this cannot be directly tested with a ‘regular’ chi-squared test, a permutation-based approach is used to calculate a p-value, that is afterwards corrected with a Benjamini-Hochberg correction for multiple testing. The output is written to the gp folder, with the main results being written to a table with the likelihood ratio, p-value and FDR for each peak. Additionally, an array with predicted standardized accessibility scores (y_pred) and one with the actual standardized values (y_std) are also written out. A similar table, but now only with significant peaks (by default FDR < 0.01, set with –permutation_cutoff) also exists, with the addition that here the peaks are clustered (k-means setup, similar to the one performed in the LRT comparison type). A visualization of these clusters, and their individual peak trajectories (standardized) is included in the folder too.

A second timecourse mode can be used when time is not encoded as a continuous variable, but as an ordinal variable. Typical use cases for this mode are differentiation trajectories. Here it’s important that order needs to be specified, with the ‘earliest’ time point being the first. Note that not all possible values of the time factor listed in the samplesheet need to be included here (samples with missing levels are simply ignored).

comparison_name:
  type: timecourse
  time: 'celltype'
  time_type: 'ordinal'
  order:
    - 'preT_DN1'
    - 'preT_DN2b'
    - 'preT_DN3'
    - 'TDP'
    - 'T4'

A similar Gaussian process regression is ran here as in the continuous case, with the exception that the timepoints are re-encoded between 0 and 1, and the distance between subsequent levels is included in the kernel. The output is similar to the output in the continuous case, with the addition of the estimated relative distance between time points per peak in a separate table. Cut offs here are controlled by –permutation_cutoff (default 1e-2), which is performed on the bh-corrected permutation p-values. The number of iterations can be set with –permutation_iterations and defaults to 1000, per peak. Note that increasing this number could lead to exorbitant runtimes (and in fact runtimes with 1000 iterations are already quite long). In case you are running a continuous timecourse, the number of timepoints for predicted curve fits can be set with –gp_timesteps, and defaults to 10.

For both ordinal and continuous timecourse analyses, time*covariate interactions can be tested by including an interaction cell in the comparison yaml. For example:

comparison_name:
  type: timecourse
  time: 'time'
  time_type: 'continuous'
  interaction: 'treatment'

Will run the original analysis (which peaks are time-sensitive, while controlling for other covariates), but will also run a second analysis, which test for the interaction between time and treatment (which peaks respond differently over time for different covariate levels). The outputs are the same as the original analysis, with the addition that predictions are done per covariate level. In case you have more then one covariate to test, interaction can also be specified as a list. Here one default analysis will be ran for time, and two interaction analyses:

comparison_name:
  type: timecourse
  time: 'time'
  time_type: 'continuous'
  interaction:
    - 'treatment'
    - 'celltype'

All the output is written into the comparison_name folder inside the gp folder.

Post processing#

Regardless of what differential analysis is performed, post processing (motif enrichment and footprinting) can be performed on significant peak groups (either the up- and down peaks in a two-group comparison, or the clusters in an LRT or timecourse analysis), given that there are enough peaks to work with (–min_sigpeaks). Enabling this mode is done by specifying a –motifs file (in MEME format) upon runtime. This should contain all of the motifs of interest that you want to test for. Note that by default, motifs are clustered first to avoid redundant hits. Note that you cannot specify a motifs file without also providing a samplesheet. Motif enrichment is performed with ame (MEME suite). In all cases, enrichment of a group is performed against the same group shuffled (’–control –shuffle–’ argument in ame).

Per default, footprinting is also performed with TOBIAS. If a samplesheet is provided, then BAM files are subsampled to the minimum depth within a group (here defined as the unique combination of covariate levels) to avoid biases in footprinting due to differences in sequencing depth, and merged together. These merged bam files are then corrected (ATACorrect module) and scored (ScoreBigwig module). If additionally, motifs are provided and at least one comparison yielded significant peaks and subsequent enriched motifs, footprinting is performed for the enriched motifs (by searching for motif occurences with fimo (MEME suite)), and plotting the aggregated corrected signal over them (plotAggregate module).

Motif enrichment results are available under the motifs folder in the output directory, the footprinting results are under the footprints folder.

All command line options#

ATAC#

Usage

ATAC [OPTIONS]

Options

-i, --bamdir <bamdir>#

Required Specify directory that contains your bamfiles.

-o, --outputdir <outputdir>#

Required Specify output directory.

-g, --gtf <gtf>#

Required Specify a gtf file containing gene annotation. Will be used to extract TSS.

-r, --genomefasta <genomefasta>#

Required Specify a fasta file that contains the reference genome.

-b, --readattractingregions <readattractingregions>#

Required Specify a bed file containing read attracting regions. Should contain the mitochondrial genome at least.

-p, --snakemakeprofile <snakemakeprofile>#

specify the name of your snakemake profile.

-@, --threads <threads>#

specify the number of threads to use. Only relevant if no snakemake profile is given.

Default:

1

-m, --motifs <motifs>#

Specify a file containing motifs. Needs to be in meme format. If not provided, no motif enrichment/footprinting will be ran.

-f, --fragsize <fragsize>#

Specify the maximum fragment size (bps) to be considered for peak calling. Sits at 150 bps by default to only use reads from nucleosome-free regions.

Default:

150

--samplesheet <samplesheet>#

specify a samplesheet (as a tsv file). Refer to the documentation on how to format this file.

--comparison <comparison>#

specify yaml file with comparisons. Required if a samplesheet is given. Refer to the documentation on how to format this file.

--mitostring <mitostring>#

Name of the mitochondrial contig (as in the reference genome / BAM file). Defaults to MT.

Default:

'MT'

--upstreamuro <upstreamuro>#

Maximum permitted distance upstream of a feature (peak annotation).

Default:

50000

--downstreamuro <downstreamuro>#

Maximum permitted distance downstream of a feature (peak annotation).

Default:

50000

--featureuro <featureuro>#

the feature in the GTF file (column 3) to use for peak annotation.

Default:

'gene'

--pseudocount <pseudocount>#

Pseudocount to add to the count matrix prior to differential calling. Only relevant in two-group mode.

Default:

8

--peakset <peakset>#

Include an external peak file (bed format). If not provided, the union of all peaks across all samples will be used to generate a count matrix.

--permutation_cutoff <permutation_cutoff>#

A p-value cutoff to determine significance after permutation testing (relevant for timecourse mode).

Default:

0.01

--permutation_iterations <permutation_iterations>#

Number of permutations to perform for significance testing (relevant for timecourse mode).

Default:

1000

--fdr_cutoff <fdr_cutoff>#

The FDR cutoff to determine significance after wald/LRT test (relevant for two-group and LRT modes).

Default:

0.001

--lfc_cutoff <lfc_cutoff>#

The log2FoldChange cutoff used to determine significance (combined with fdr_cutoff, only relevant for two-group mode).

Default:

0

--min_sigpeaks <min_sigpeaks>#

Minimum number of significant peaks in LRT/GP mode to continue with downstream analysis.

Default:

100

--gp_timesteps <gp_timesteps>#

Number of timesteps to use for gaussian process regression.

Default:

10

--gp_alpha <gp_alpha>#

the variance for the additional noise term in the time-course mode. Higher values result in more smoothing (flatter curves). Lower values result in less smoothing (more jagged fits).

Default:

0.1