Groovy Essentials for Nextflow Developers¶

Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Java Virtual Machine. While Nextflow provides the workflow orchestration framework, Groovy provides the programming language foundations that make your workflows flexible, maintainable, and powerful.

Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts.

Many Nextflow developers struggle with distinguishing when to use Nextflow versus Groovy features, processing file names and configurations, and handling errors gracefully. This side quest will bridge that gap by taking you on a journey from basic workflow concepts to production-ready pipeline mastery.

We'll transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production:

Messy data? We'll add robust parsing and null-safe operators, learning to distinguish between Nextflow and Groovy constructs
Complex file naming schemes? We'll master regex patterns and string manipulation for bioinformatics file names
Need intelligent sample routing? We'll implement conditional logic and strategy selection, transforming file collections into command-line arguments
Worried about failures? We'll add comprehensive error handling and validation patterns
Code getting repetitive? We'll learn functional programming with closures and composition, mastering essential Groovy operators like safe navigation and Elvis
Processing thousands of samples? We'll leverage powerful collection operations for file path manipulations

0. Warmup¶

0.1. Prerequisites¶

Before taking on this side quest you should:

Complete the Hello Nextflow tutorial
Understand basic Nextflow concepts (processes, channels, workflows)
Have basic familiarity with Groovy syntax (variables, maps, lists)

This tutorial will explain Groovy concepts as we encounter them, so you don't need extensive prior Groovy knowledge. We'll start with fundamental concepts and build up to advanced patterns.

0.2. Starting Point¶

Let's move into the project directory and explore our working materials.

Navigate to project directory

cd side-quests/groovy_essentials

You'll find a data directory with sample files and a main workflow file that we'll evolve throughout this tutorial.

Directory contents

> tree
.
├── data
│   ├── metadata
│   │   └── analysis_parameters.yaml
│   ├── samples.csv
│   └── sequences
│       ├── sample_001.fastq
│       ├── sample_002.fastq
│       └── sample_003.fastq
├── main.nf
├── nextflow.config
├── README.md
└── templates
    └── analysis_script.sh

5 directories, 9 files

Our sample CSV contains information about biological samples that need different processing based on their characteristics:

samples.csv

sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score
SAMPLE_001,human,liver,30000000,data/sequences/sample_001.fastq,38.5
SAMPLE_002,mouse,brain,25000000,data/sequences/sample_002.fastq,35.2
SAMPLE_003,human,kidney,45000000,data/sequences/sample_003.fastq,42.1

We'll use this realistic dataset to explore practical Groovy techniques that you'll encounter in real bioinformatics workflows.

1. Nextflow vs Groovy: Understanding the Boundaries¶

1.1. Identifying What's What¶

One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's build a workflow step by step to see how they work together.

Step 1: Basic Nextflow Workflow¶

Start with a simple workflow that just reads the CSV file:

main.nf
workflow {
    ch_samples = Channel.fromPath("./data/samples.csv")
        .splitCsv(header: true)
        .view()
}

The workflow block defines our pipeline structure, while Channel.fromPath() creates a channel from a file path. The .splitCsv() operator processes the CSV file and converts each row into a map data structure.

Run this workflow to see the raw CSV data:

Test basic workflow

nextflow run main.nf

You should see output like:

Raw CSV data

[id:sample_001, organism:human, tissue_type:liver, sequencing_depth:30000000, file_path:data/sequences/sample_001.fastq, quality_score:38.5]
[id:sample_002, organism:mouse, tissue_type:brain, sequencing_depth:25000000, file_path:data/sequences/sample_002.fastq, quality_score:35.2]
[id:sample_003, organism:human, tissue_type:kidney, sequencing_depth:45000000, file_path:data/sequences/sample_003.fastq, quality_score:42.1]

Step 2: Adding the Map Operator¶

Now let's add the .map() operator, which is a Nextflow channel operator (not to be confused with the map data structure we'll see below). This operator takes a closure where we can write Groovy code to transform each item.

A closure is a block of code that can be passed around and executed later. Think of it as a function that you define inline. In Groovy, closures are written with curly braces { } and can take parameters. They're fundamental to how Nextflow operators work.

AfterBefore

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            return row
        }
        .view()

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .view()

The .map { row -> ... } operator takes a closure where row represents each item from the channel. This is a named parameter - you can call it anything you want. For example, you could write .map { item -> ... } or .map { sample -> ... } and it would work exactly the same way.

When Nextflow processes each item in the channel, it passes that item to your closure as the parameter you named. So if your channel contains CSV rows, row will hold one complete row at a time.

Apply this change and run the workflow:

Test map operator

nextflow run main.nf

You'll see the same output as before, because we're simply returning the input unchanged. This confirms that the map operator is working correctly. Now let's start transforming the data.

Step 3: Creating a Map Data Structure¶

Now we're going to write pure Groovy code inside our closure. Everything from this point forward is Groovy syntax and methods, not Nextflow operators.

AfterBefore

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            // This is all Groovy code now!
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            return sample_meta
        }
        .view()

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            return row
        }
        .view()

Notice how we've left Nextflow syntax behind and are now writing pure Groovy code. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. It lets us store related pieces of information together. In this map, we're storing the sample ID, organism, tissue type, sequencing depth, and quality score.

We use Groovy's string manipulation methods like .toLowerCase() and .replaceAll() to clean up our data, and type conversion methods like .toInteger() and .toDouble() to convert string data from the CSV into the appropriate numeric types.

Apply this change and run the workflow:

Test map data structure

nextflow run main.nf

You should see the refined map output like:

Transformed metadata

[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5]
[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2]
[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1]

Step 4: Adding Conditional Logic¶

Now let's add more Groovy logic - this time using a ternary operator to make decisions based on data values.

AfterBefore

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            def priority = sample_meta.quality > 40 ? 'high' : 'normal'
            return sample_meta + [priority: priority]
        }
        .view()

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            return sample_meta
        }
        .view()

The ternary operator is a shorthand for an if/else statement that follows the pattern condition ? value_if_true : value_if_false. This line means: "If the quality is greater than 40, use 'high', otherwise use 'normal'".

The map addition operator + creates a new map rather than modifying the existing one. This line creates a new map that contains all the key-value pairs from sample_meta plus the new priority key.

Apply this change and run the workflow:

Test conditional logic

nextflow run main.nf

You should see output like:

Metadata with priority

[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal]
[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal]
[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high]

Step 5: Combining Maps and Returning Results¶

Finally, let's use Groovy's map addition operator to combine our metadata, then return a tuple that follows Nextflow's standard pattern.

AfterBefore

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            def priority = sample_meta.quality > 40 ? 'high' : 'normal'
            return [sample_meta + [priority: priority], file(row.file_path)]
        }
        .view()

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            def priority = sample_meta.quality > 40 ? 'high' : 'normal'
            return sample_meta + [priority: priority]
        }
        .view()

This returns a tuple containing the enriched metadata and the file path, which is the standard pattern for passing data to processes in Nextflow.

Apply this change and run the workflow:

Test complete workflow

nextflow run main.nf

You should see output like:

Complete workflow output

[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_001.fastq]
[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_002.fastq]
[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_003.fastq]

Note

Key Pattern: Nextflow operators often take closures { ... } as parameters. Everything inside these closures is Groovy code. This is how Nextflow orchestrates workflows while Groovy handles the data processing logic.

Note

Maps and Metadata: Maps are fundamental to working with metadata in Nextflow. For a more detailed explanation of working with metadata maps, see the Working with metadata side quest.

Our workflow demonstrates the core pattern: Nextflow constructs (workflow, Channel.fromPath(), .splitCsv(), .map(), .view()) orchestrate data flow, while basic Groovy constructs (maps [key: value], string methods, type conversions, ternary operators) handle the data processing logic inside the .map() closure.

1.2. Distinguishing Nextflow operators from Groovy functions¶

Having a clear understanding of which parts of your code are using basic Groovy is especially important when syntax overlaps between the two languages.

A perfect example of this confusion is the collect operation, which exists in both contexts but does completely different things. Groovy's collect transforms each element, while Nextflow's collect gathers all channel elements into a single-item channel.

Let's demonstrate this with some sample data. Check out collect.nf:

collect.nf
// Demonstrate Groovy vs Nextflow collect
def sample_ids = ['sample_001', 'sample_002', 'sample_003']

println "=== GROOVY COLLECT (transforms each item, keeps same structure) ==="
// Groovy collect: transforms each element but maintains list structure
def formatted_ids = sample_ids.collect { id ->
    id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_')
}
println "Original list: ${sample_ids}"
println "Groovy collect result: ${formatted_ids}"
println "Groovy collect maintains structure: ${formatted_ids.size} items (same as original)"
println ""

println "\n=== NEXTFLOW COLLECT (groups multiple items into single emission) ==="
// Nextflow collect: groups channel elements into a single emission
ch_input = Channel.of('sample_001', 'sample_002', 'sample_003')

// Show individual items before collect
ch_input.view { "Individual channel item: ${it}" }

// Collect groups all items into a single emission
ch_collected = ch_input.collect()
ch_collected.view { "Nextflow collect result: ${it} (${it.size()} items grouped together)" }

Run the workflow to see both collect operations in action:

Test collect operations

nextflow run collect.nf

Different collect behaviors

 N E X T F L O W   ~  version 25.04.6

Launching `collect.nf` [silly_bhaskara] DSL2 - revision: 5ef004224c

=== GROOVY COLLECT (transforms each item, keeps same structure) ===
Original list: [sample_001, sample_002, sample_003]
Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003]
Groovy collect maintains structure: 3 items (same as original)

=== NEXTFLOW COLLECT (groups multiple items into single emission) ===
Individual channel item: sample_001
Individual channel item: sample_002
Individual channel item: sample_003
Nextflow collect result: [sample_001, sample_002, sample_003] (3 items grouped together)

The key difference: Groovy's collect transforms items but preserves structure (like Nextflow's map), while Nextflow's collect() groups multiple channel emissions into a single list.

But collect really isn't the main point. The key lesson: always distinguish between Groovy constructs (data structures) and Nextflow constructs (channels/workflows). Operations can share names but behave completely differently.

Takeaway¶

In this section, you've learned:

Distinguishing Nextflow from Groovy: How to identify which language construct you're using
Context matters: The same operation name can have completely different behaviors
Workflow structure: Nextflow provides the orchestration, Groovy provides the logic
Data transformation patterns: When to use Groovy methods vs Nextflow operators

Understanding these boundaries is essential for debugging, documentation, and writing maintainable workflows.

Now that we can distinguish between Nextflow and Groovy constructs, let's enhance our sample processing pipeline with more sophisticated data handling capabilities.

2. Advanced String Processing for Bioinformatics¶

Our basic pipeline processes CSV metadata well, but this is just the beginning. In production bioinformatics, you'll encounter files from different sequencing centers with varying naming conventions, legacy datasets with non-standard formats, and the need to extract critical information from filenames themselves.

The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to mastering Groovy's string processing capabilities. Let's transform our pipeline to handle the messy realities of real-world bioinformatics data.

2.1. Pattern Matching and Regular Expressions¶

Many bioinformatics workflows encounter files with complex naming conventions that encode important metadata. Let's see how Groovy's pattern matching can extract this information automatically.

Let's start with a simple example of extracting sample information from file names:

AfterBefore

main.nf
// Pattern matching for sample file names
def sample_files = [
    'Human_Liver_001.fastq',
    'mouse_brain_002.fastq',
    'SRR12345678.fastq'
]

// Simple pattern to extract organism and tissue
def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/

sample_files.each { filename ->
    def matcher = filename =~ pattern
    if (matcher) {
        println "${filename} -> Organism: ${matcher[0][1]}, Tissue: ${matcher[0][2]}, ID: ${matcher[0][3]}"
    } else {
        println "${filename} -> No standard pattern match"
    }
}

main.nf
25

This demonstrates key Groovy string processing concepts:

Regular expression literals using ~/pattern/ syntax - this creates a regex pattern without needing to escape backslashes
Pattern matching with the =~ operator - this attempts to match a string against a regex pattern
Matcher objects that capture groups with [0][1], [0][2], etc. - [0] refers to the entire match, [1], [2], etc. refer to captured groups in parentheses

Run this to see the pattern matching in action:

Test pattern matching

nextflow run main.nf

Pattern matching results

Human_Liver_001.fastq -> Organism: Human, Tissue: Liver, ID: 001
mouse_brain_002.fastq -> Organism: mouse, Tissue: brain, ID: 002
SRR12345678.fastq -> No standard pattern match

2.2. Creating Reusable Parsing Functions¶

Let's create a simple function to parse sample names and return structured metadata:

AfterBefore

main.nf
// Function to extract sample metadata from filename
def parseSampleName(String filename) {
    def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/
    def matcher = filename =~ pattern

    if (matcher) {
        return [
            organism: matcher[0][1].toLowerCase(),
            tissue: matcher[0][2].toLowerCase(),
            sample_id: matcher[0][3],
            valid: true
        ]
    } else {
        return [
            filename: filename,
            valid: false
        ]
    }
}

// Test the parser
sample_files.each { filename ->
    def parsed = parseSampleName(filename)
    println "File: ${filename}"
    if (parsed.valid) {
        println "  Organism: ${parsed.organism}, Tissue: ${parsed.tissue}, ID: ${parsed.sample_id}"
    } else {
        println "  Could not parse filename"
    }
}

main.nf
40

This demonstrates key Groovy function patterns:

Function definitions with def functionName(parameters) - similar to other languages but with dynamic typing
Map creation and return for structured data - maps are Groovy's primary data structure for returning multiple values
Conditional returns based on pattern matching success - functions can return different data structures based on conditions

2.3. Dynamic Script Logic in Processes¶

In Nextflow, dynamic behavior comes from using Groovy logic within process script blocks, not generating script strings. Here are realistic patterns:

AfterBefore

main.nf
// Process with conditional script logic
process QUALITY_FILTER {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}_filtered.fastq")

    script:
    // Groovy logic to determine parameters based on metadata
    def quality_threshold = meta.organism == 'human' ? 30 :
                           meta.organism == 'mouse' ? 28 : 25
    def min_length = meta.priority == 'high' ? 75 : 50

    // Conditional script sections
    def extra_qc = meta.priority == 'high' ? '--strict-quality' : ''

    """
    echo "Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})"

    # Dynamic quality filtering based on sample characteristics
    fastp \\
        --in1 ${reads} \\
        --out1 ${meta.id}_filtered.fastq \\
        --qualified_quality_phred ${quality_threshold} \\
        --length_required ${min_length} \\
        ${extra_qc}

    echo "Applied quality threshold: ${quality_threshold}"
    echo "Applied length threshold: ${min_length}"
    """
}

// Process with completely different scripts based on organism
process ALIGN_READS {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}.bam")

    script:
    if (meta.organism == 'human') {
        """
        echo "Using human-specific STAR alignment"
        STAR --runMode alignReads \\
            --genomeDir /refs/human/STAR \\
            --readFilesIn ${reads} \\
            --outSAMtype BAM SortedByCoordinate \\
            --outFileNamePrefix ${meta.id}
        mv ${meta.id}Aligned.sortedByCoord.out.bam ${meta.id}.bam
        """
    } else if (meta.organism == 'mouse') {
        """
        echo "Using mouse-specific bowtie2 alignment"
        bowtie2 -x /refs/mouse/genome \\
            -U ${reads} \\
            --sensitive \\
            | samtools sort -o ${meta.id}.bam -
        """
    } else {
        """
        echo "Using generic alignment for ${meta.organism}"
        minimap2 -ax sr /refs/generic/genome.fa ${reads} \\
            | samtools sort -o ${meta.id}.bam -
        """
    }
}

// Using templates (Nextflow's built-in templating)
process GENERATE_REPORT {
    input:
    tuple val(meta), path(results)

    output:
    path("${meta.id}_report.html")

    script:
    template 'report_template.sh'
}

main.nf
115

Now let's look at the template file that would go with this:

AfterBefore

templates/report_template.sh
#!/bin/bash

# This template has access to all variables from the process input
# Groovy expressions are evaluated at runtime

echo "Generating report for sample: ${meta.id}"
echo "Organism: ${meta.organism}"
echo "Quality score: ${meta.quality}"

# Conditional logic in template
<% if (meta.organism == 'human') { %>
echo "Including human-specific quality metrics"
human_qc_script.py --input ${results} --output ${meta.id}_report.html
<% } else { %>
echo "Using standard quality metrics for ${meta.organism}"
generic_qc_script.py --input ${results} --output ${meta.id}_report.html
<% } %>

# Groovy variables can be used for calculations
<%
def priority_bonus = meta.priority == 'high' ? 0.1 : 0.0
def adjusted_score = (meta.quality + priority_bonus).round(2)
%>

echo "Adjusted quality score: ${adjusted_score}"
echo "Report generation complete"

templates/report_template.sh

This demonstrates realistic Nextflow patterns:

Conditional script blocks using Groovy if/else in the script section
Variable interpolation directly in script blocks
Template files with Groovy expressions (using <% %> and ${})
Dynamic parameter calculation based on metadata

2.4. Transforming File Collections into Command Arguments¶

A particularly powerful pattern is using Groovy logic in the script block to transform collections of files into properly formatted command-line arguments. This is essential when tools expect multiple files as separate arguments:

AfterBefore

main.nf
// Process that needs to handle multiple input files
process JOINT_ANALYSIS {
    input:
    path sample_files  // This will be a list of files
    path reference

    output:
    path "joint_results.txt"

    script:
    // Transform file list into command arguments
    def file_args = sample_files.collect { file -> "--input ${file}" }.join(' ')
    def sample_names = sample_files.collect { file ->
        file.baseName.replaceAll(/\..*$/, '')
    }.join(',')

    """
    echo "Processing ${sample_files.size()} samples"
    echo "Sample names: ${sample_names}"

    # Use the transformed arguments in the actual command
    analysis_tool \\
        ${file_args} \\
        --reference ${reference} \\
        --output joint_results.txt \\
        --samples ${sample_names}
    """
}

// Process that builds complex command based on file characteristics
process VARIABLE_COMMAND {
    input:
    tuple val(meta), path(files)

    output:
    path "${meta.id}_processed.txt"

    script:
    // Complex command building based on file types and metadata
    def input_flags = files.collect { file ->
        def extension = file.getExtension()
        switch(extension) {
            case 'bam':
                return "--bam-input ${file}"
            case 'vcf':
                return "--vcf-input ${file}"
            case 'bed':
                return "--intervals ${file}"
            default:
                return "--data-input ${file}"
        }
    }.join(' ')

    // Additional flags based on metadata
    def extra_flags = meta.quality > 35 ? '--high-quality' : ''

    """
    echo "Building command for ${meta.id}"

    variant_caller \\
        ${input_flags} \\
        ${extra_flags} \\
        --output ${meta.id}_processed.txt
    """
}

main.nf
200

Key patterns demonstrated:

File collection transformation: Using .collect{} to transform each file into a command argument
String joining: Using .join(' ') to combine arguments with spaces
File name manipulation: Using .baseName and .replaceAll() for sample names
Conditional argument building: Using switch statements or conditionals to build different arguments based on file types
Multiple transformations: Building both file arguments and sample name lists from the same collection

Takeaway¶

In this section, you've learned:

Regular expression patterns for bioinformatics file name parsing
Reusable parsing functions that return structured metadata
Process script logic with conditional parameter selection
File collection transformation into command-line arguments using .collect{} and .join()
Command building patterns based on file types and metadata

These string processing techniques form the foundation for handling complex data pipelines that need to adapt to different input formats and generate appropriate commands for bioinformatics tools.

With our pipeline now capable of extracting rich metadata from both CSV files and file names, we can make intelligent decisions about how to process different samples. Let's add conditional logic to route samples through appropriate analysis strategies.

3. Conditional Logic and Process Control¶

3.1. Strategy Selection Based on Sample Characteristics¶

Now that our pipeline can extract comprehensive sample metadata, we can use this information to automatically select the most appropriate analysis strategy for each sample. Different organisms, sequencing depths, and quality scores require different processing approaches.

AfterBefore

main.nf
// Dynamic process selection based on sample characteristics
def selectAnalysisStrategy(Map sample_meta) {
    def strategy = [:]

    // Sequencing depth determines processing approach
    if (sample_meta.depth < 10_000_000) {
        strategy.approach = 'low_depth'
        strategy.processes = ['quality_check', 'simple_alignment']
        strategy.sensitivity = 'high'
    } else if (sample_meta.depth < 50_000_000) {
        strategy.approach = 'standard'
        strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling']
        strategy.sensitivity = 'standard'
    } else {
        strategy.approach = 'high_depth'
        strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling', 'structural_variants']
        strategy.sensitivity = 'sensitive'
    }

    // Organism-specific adjustments
    switch(sample_meta.organism) {
        case 'human':
            strategy.reference = 'GRCh38'
            strategy.known_variants = 'dbSNP'
            break
        case 'mouse':
            strategy.reference = 'GRCm39'
            strategy.known_variants = 'mgp_variants'
            break
        default:
            strategy.reference = 'custom'
            strategy.known_variants = null
    }

    // Quality-based modifications
    if (sample_meta.quality < 30) {
        strategy.extra_qc = true
        strategy.processes = ['extensive_qc'] + strategy.processes
    }

    return strategy
}

// Test strategy selection
ch_samples
    .map { meta, file ->
        def strategy = selectAnalysisStrategy(meta)
        println "\nSample: ${meta.id}"
        println "  Strategy: ${strategy.approach}"
        println "  Processes: ${strategy.processes.join(' -> ')}"
        println "  Reference: ${strategy.reference}"
        println "  Extra QC: ${strategy.extra_qc ?: false}"

        return [meta + strategy, file]
    }
    .view { meta, file -> "Ready for processing: ${meta.id} (${meta.approach})" }

main.nf
175

This demonstrates several Groovy patterns commonly used in Nextflow workflows:

Numeric literals with underscores for readability (10_000_000) - underscores can be used in numbers to improve readability
Switch statements for multi-way branching - cleaner than multiple if/else statements
List concatenation with + operator - combines two lists into one
Elvis operator ?: for null handling - provides a default value if the left side is null or false
Map merging to combine metadata with strategy - the + operator merges two maps, with the right map taking precedence

3.2. Conditional Process Execution¶

In Nextflow, you control which processes run for which samples using when conditions and channel routing:

AfterBefore

main.nf
// Different processes for different strategies
process BASIC_QC {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}_basic_qc.html")

    when:
    meta.approach == 'low_depth'

    script:
    """
    fastqc --quiet ${reads} -o ./
    mv *_fastqc.html ${meta.id}_basic_qc.html
    """
}

process COMPREHENSIVE_QC {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}_comprehensive_qc.html")

    when:
    meta.approach in ['standard', 'high_depth']

    script:
    def sensitivity = meta.sensitivity == 'high' ? '--strict' : ''
    """
    fastqc ${sensitivity} ${reads} -o ./
    # Additional QC for comprehensive analysis
    seqtk fqchk ${reads} > sequence_stats.txt
    mv *_fastqc.html ${meta.id}_comprehensive_qc.html
    """
}

process SIMPLE_ALIGNMENT {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}.bam")

    when:
    meta.approach == 'low_depth'

    script:
    """
    minimap2 -ax sr ${meta.reference} ${reads} \\
        | samtools sort -o ${meta.id}.bam -
    """
}

process SENSITIVE_ALIGNMENT {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}.bam")

    when:
    meta.approach in ['standard', 'high_depth']

    script:
    def params = meta.sensitivity == 'sensitive' ? '--very-sensitive' : '--sensitive'
    """
    bowtie2 ${params} -x ${meta.reference} -U ${reads} \\
        | samtools sort -o ${meta.id}.bam -
    """
}

// Workflow logic that routes to appropriate processes
workflow ANALYSIS_PIPELINE {
    take:
    samples_ch

    main:
    // All samples go through appropriate QC
    basic_qc_results = BASIC_QC(samples_ch)
    comprehensive_qc_results = COMPREHENSIVE_QC(samples_ch)

    // Combine QC results
    qc_results = basic_qc_results.mix(comprehensive_qc_results)

    // All samples go through appropriate alignment
    simple_alignment_results = SIMPLE_ALIGNMENT(samples_ch)
    sensitive_alignment_results = SENSITIVE_ALIGNMENT(samples_ch)

    // Combine alignment results
    alignment_results = simple_alignment_results.mix(sensitive_alignment_results)

    emit:
    qc = qc_results
    alignments = alignment_results
}

main.nf
225

This shows realistic Nextflow patterns:

Separate processes for different strategies rather than dynamic generation
When conditions to control which processes run for which samples
Mix operator to combine results from different conditional processes
Process parameterization using metadata in script blocks

3.3. Channel-based Workflow Routing¶

The realistic way to handle conditional workflow assembly is through channel routing and filtering:

AfterBefore

main.nf
workflow {
    // Read and enrich sample data with strategy
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row ->
            def meta = [
                id: row.sample_id,
                organism: row.organism,
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]

            // Add strategy information using our selectAnalysisStrategy function
            def strategy = selectAnalysisStrategy(meta)

            return [meta + strategy, file(row.file_path)]
        }

    // Split channel based on strategy requirements
    ch_samples
        .branch { meta, reads ->
            low_depth: meta.approach == 'low_depth'
                return [meta, reads]
            standard: meta.approach == 'standard'
                return [meta, reads]
            high_depth: meta.approach == 'high_depth'
                return [meta, reads]
        }
        .set { samples_by_strategy }

    // Route each strategy through appropriate processes
    ANALYSIS_PIPELINE(samples_by_strategy.low_depth)
    ANALYSIS_PIPELINE(samples_by_strategy.standard)
    ANALYSIS_PIPELINE(samples_by_strategy.high_depth)

    // For high-depth samples, also run structural variant calling
    high_depth_alignments = ANALYSIS_PIPELINE.out.alignments
        .filter { meta, bam -> meta.approach == 'high_depth' }

    STRUCTURAL_VARIANTS(high_depth_alignments)

    // Collect and organize all results
    all_qc = ANALYSIS_PIPELINE.out.qc.collect()
    all_alignments = ANALYSIS_PIPELINE.out.alignments.collect()

    // Generate summary report based on what was actually run
    all_alignments
        .map { alignments ->
            def strategies = alignments.collect { meta, bam -> meta.approach }.unique()
            def total_samples = alignments.size()

            println "Pipeline Summary:"
            println "  Total samples processed: ${total_samples}"
            println "  Strategies used: ${strategies.join(', ')}"

            strategies.each { strategy ->
                def count = alignments.count { meta, bam -> meta.approach == strategy }
                println "    ${strategy}: ${count} samples"
            }
        }
        .view()
}

// Additional process for high-depth samples
process STRUCTURAL_VARIANTS {
    input:
    tuple val(meta), path(bam)

    output:
    tuple val(meta), path("${meta.id}.vcf")

    script:
    """
    delly call -g ${meta.reference} ${bam} -o ${meta.id}.vcf
    """
}

main.nf
285

Key Nextflow patterns demonstrated:

Channel branching with .branch{} to split samples by strategy
Conditional process execution using when: directives and filtering
Channel routing to send different samples through different processes
Result collection and summary generation
Process reuse - the same workflow processes different sample types

Takeaway¶

In this section, you've learned:

Strategy selection using Groovy conditional logic
Process control with when conditions and channel routing
Workflow branching using channel operators like .branch() and .filter()
Metadata enrichment to drive process selection

These patterns help you write workflows that process different sample types appropriately while keeping your code organized and maintainable.

Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's add validation and error handling to make our pipeline robust enough for real-world use.

4. Error Handling and Validation Patterns¶

4.1. Basic Input Validation¶

Before our pipeline processes samples through complex conditional logic, we should validate that the input data meets our requirements. Let's create validation functions that check sample metadata and provide useful error messages:

AfterBefore

main.nf
// Simple validation function
def validateSample(Map sample) {
    def errors = []

    // Check required fields
    if (!sample.sample_id) {
        errors << "Missing sample_id"
    }

    if (!sample.organism) {
        errors << "Missing organism"
    }

    // Validate organism
    def valid_organisms = ['human', 'mouse', 'rat']
    if (sample.organism && !valid_organisms.contains(sample.organism.toLowerCase())) {
        errors << "Invalid organism: ${sample.organism}"
    }

    // Check sequencing depth is numeric
    if (sample.sequencing_depth) {
        try {
            def depth = sample.sequencing_depth as Integer
            if (depth < 1000000) {
                errors << "Sequencing depth too low: ${depth}"
            }
        } catch (NumberFormatException e) {
            errors << "Invalid sequencing depth: ${sample.sequencing_depth}"
        }
    }

    return errors
}

// Test validation
def test_samples = [
    [sample_id: 'SAMPLE_001', organism: 'human', sequencing_depth: '30000000'],
    [sample_id: '', organism: 'alien', sequencing_depth: 'invalid'],
    [sample_id: 'SAMPLE_003', organism: 'mouse', sequencing_depth: '500000']
]

test_samples.each { sample ->
    def errors = validateSample(sample)
    if (errors) {
        println "Sample ${sample.sample_id}: ${errors.join(', ')}"
    } else {
        println "Sample ${sample.sample_id}: Valid"
    }
}

main.nf
330

4.2. Try-Catch Error Handling¶

Let's implement simple try-catch patterns for handling errors:

AfterBefore

main.nf
// Process sample with error handling
def processSample(Map sample) {
    try {
        // Validate first
        def errors = validateSample(sample)
        if (errors) {
            throw new RuntimeException("Validation failed: ${errors.join(', ')}")
        }

        // Simulate processing
        def result = [
            id: sample.sample_id,
            organism: sample.organism,
            processed: true
        ]

        println "✓ Successfully processed ${sample.sample_id}"
        return result

    } catch (Exception e) {
        println "✗ Error processing ${sample.sample_id}: ${e.message}"

        // Return partial result
        return [
            id: sample.sample_id ?: 'unknown',
            organism: sample.organism ?: 'unknown',
            processed: false,
            error: e.message
        ]
    }
}

// Test error handling
test_samples.each { sample ->
    def result = processSample(sample)
    println "Result for ${result.id}: processed = ${result.processed}"
}

main.nf
370

4.3. Setting Defaults and Validation¶

Let's create a simple function that provides defaults and validates configuration:

AfterBefore

main.nf
// Simple configuration with defaults
def getConfig(Map user_params) {
    // Set defaults
    def defaults = [
        quality_threshold: 30,
        max_cpus: 4,
        output_dir: './results'
    ]

    // Merge user params with defaults
    def config = defaults + user_params

    // Simple validation
    if (config.quality_threshold < 0 || config.quality_threshold > 40) {
        println "Warning: Quality threshold ${config.quality_threshold} out of range, using default"
        config.quality_threshold = defaults.quality_threshold
    }

    if (config.max_cpus < 1) {
        println "Warning: Invalid CPU count ${config.max_cpus}, using default"
        config.max_cpus = defaults.max_cpus
    }

    return config
}

// Test configuration
def test_configs = [
    [:], // Empty - should get defaults
    [quality_threshold: 35, max_cpus: 8], // Valid values
    [quality_threshold: -5, max_cpus: 0] // Invalid values
]

test_configs.each { user_config ->
    def config = getConfig(user_config)
    println "Input: ${user_config} -> Output: ${config}"
}

main.nf
400

Takeaway¶

In this section, you've learned:

Basic validation functions that check required fields and data types
Try-catch error handling for graceful failure handling
Configuration with defaults using map merging and validation

These patterns help you write workflows that handle invalid input gracefully and provide useful feedback to users.

Before diving into advanced closures, let's master some essential Groovy language features that make code more concise and null-safe. These operators and patterns are used throughout production Nextflow workflows and will make your code more robust and readable.

5. Essential Groovy Operators and Patterns¶

With our pipeline now handling complex conditional logic, we need to make it more robust against missing or malformed data. Bioinformatics workflows often deal with incomplete metadata, optional configuration parameters, and varying input formats. Let's enhance our pipeline with essential Groovy operators that handle these challenges gracefully.

Note

Safe Navigation (?.) and Elvis (?:) Operators: These are essential for null-safe programming. Safe navigation returns null instead of throwing an exception if the object is null, while the Elvis operator provides a default value if the left side is null, empty, or false.

The safe navigation operator (?.) and Elvis operator (?:) are essential for null-safe programming when processing real-world biological data:

Safe navigation (?.) - returns null instead of throwing an exception if the object is null
Elvis operator (?:) - provides a default value if the left side is null, empty, or false

AfterBefore

main.nf
workflow {
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row ->
            // Safe navigation prevents crashes on missing fields
            def sample_id = row.sample_id?.toLowerCase() ?: 'unknown_sample'
            def organism = row.organism?.toLowerCase() ?: 'unknown'

            // Elvis operator provides defaults
            def quality = (row.quality_score as Double) ?: 30.0
            def depth = (row.sequencing_depth as Integer) ?: 1_000_000

            // Chain operators for conditional defaults
            def reference = row.reference ?: (organism == 'human' ? 'GRCh38' : 'custom')

            // Groovy Truth - empty strings and nulls are false
            def priority = row.priority ?: (quality > 40 ? 'high' : 'normal')

            return [
                id: sample_id,
                organism: organism,
                quality: quality,
                depth: depth,
                reference: reference,
                priority: priority
            ]
        }
        .view { meta ->
            "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}"
        }
}

main.nf
320

5.2. String Patterns and Multi-line Templates¶

Groovy provides powerful string features for parsing filenames and generating dynamic commands:

AfterBefore

main.nf
workflow {
    // Demonstrate slashy strings for regex (no need to escape backslashes)
    def parseFilename = { filename ->
                // Slashy string - compare to regular string: "^(\\w+)_(\\w+)_(\\d+)\\.fastq$"
    // Slashy strings don't require escaping backslashes, making regex patterns much cleaner
    def pattern = /^(\w+)_(\w+)_(\d+)\.fastq$/
        def matcher = filename =~ pattern

        if (matcher) {
            return [
                organism: matcher[0][1].toLowerCase(),
                tissue: matcher[0][2].toLowerCase(),
                sample_id: matcher[0][3]
            ]
        } else {
            return [organism: 'unknown', tissue: 'unknown', sample_id: 'unknown']
        }
    }

    // Multi-line strings with interpolation for command generation
    def generateCommand = { meta ->
        def depth_category = meta.depth > 10_000_000 ? 'high' : 'standard'
        def db_path = meta.organism == 'human' ? '/db/human' : '/db/other'

        // Multi-line string with variable interpolation
        """
        echo "Processing ${meta.organism} sample: ${meta.sample_id}"
        analysis_tool \\
            --sample ${meta.sample_id} \\
            --depth-category ${depth_category} \\
            --database ${db_path} \\
            --threads ${params.max_cpus ?: 4}
        """
    }

    // Test the patterns
    ch_files = Channel.of('Human_Liver_001.fastq', 'Mouse_Brain_002.fastq')
        .map { filename ->
            def parsed = parseFilename(filename)
            def command = generateCommand([sample_id: parsed.sample_id, organism: parsed.organism, depth: 15_000_000])
            return [parsed, command]
        }
        .view { parsed, command -> "Parsed: ${parsed}, Command: ${command.split('\n')[0]}..." }
}

main.nf
370

5.3. Combining Operators for Robust Data Handling¶

Let's combine these operators in a realistic workflow scenario:

AfterBefore

main.nf
workflow {
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row ->
            // Combine safe navigation and Elvis operators
            def meta = [
                id: row.sample_id?.toLowerCase() ?: 'unknown',
                organism: row.organism ?: 'unknown',
                quality: (row.quality_score as Double) ?: 30.0,
                files: row.file_path ? [file(row.file_path)] : []
            ]

            // Use Groovy Truth for validation
            if (meta.files && meta.id != 'unknown') {
                return [meta, meta.files]
            } else {
                log.info "Skipping sample with missing data: ${meta.id}"
                return null
            }
        }
        .filter { it != null }  // Remove invalid samples using Groovy Truth
        .view { meta, files ->
            "Valid sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}"
        }
}

main.nf
420

Takeaway¶

In this section, you've learned:

Safe navigation operator (?.) for null-safe property access
Elvis operator (?:) for default values and null coalescing

Note

Groovy Truth: In Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts. This is different from many other languages and is essential to understand for proper conditional logic.

Groovy Truth - how null, empty strings, and empty collections evaluate to false - in Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts
Slashy strings (/pattern/) for regex patterns without escaping
Multi-line string interpolation for command templates
Numeric literals with underscores for improved readability

These patterns make your code more resilient to missing data and easier to read, which is essential when processing diverse bioinformatics datasets.

6. Advanced Closures and Functional Programming¶

Our pipeline now handles missing data gracefully and processes complex input formats robustly. But as our workflow grows more sophisticated, we start seeing repeated patterns in our data transformation code. Instead of copy-pasting similar closures across different channel operations, let's learn how to create reusable, composable functions that make our code cleaner and more maintainable.

6.1. Named Closures for Reusability¶

Note

Closures: A closure is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused. They're fundamental to Groovy's functional programming capabilities.

So far we've used anonymous closures defined inline within channel operations. When you find yourself repeating the same transformation logic across multiple processes or workflows, named closures can eliminate duplication and improve readability:

A closure is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused.

AfterBefore

main.nf
// Define reusable closures for common transformations
def extractSampleInfo = { row ->
    [
        id: row.sample_id.toLowerCase(),
        organism: row.organism,
        quality: row.quality_score.toDouble(),
        depth: row.sequencing_depth.toInteger()
    ]
}

def addPriority = { meta ->
    meta + [priority: meta.quality > 40 ? 'high' : 'normal']
}

def formatForDisplay = { meta, file_path ->
    "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}"
}

workflow {
    // Use named closures in channel operations
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)        // Named closure
        .map(addPriority)              // Named closure
        .map { meta -> [meta, file("./data/sequences/${meta.id}.fastq")] }
        .view(formatForDisplay)        // Named closure

    // Reuse the same closures elsewhere
    ch_filtered = ch_samples
        .filter { meta, file -> meta.quality > 30 }
        .map { meta, file -> addPriority(meta) }  // Reuse closure
        .view(formatForDisplay)                    // Reuse closure
}

main.nf
350

6.2. Function Composition¶

Groovy closures can be composed together using the >> operator, allowing you to build complex transformations from simple, reusable pieces:

Function composition means chaining functions together so the output of one becomes the input of the next. The >> operator creates a new closure that applies multiple transformations in sequence.

AfterBefore

main.nf
// Simple transformation closures
def normalizeId = { meta ->
    meta + [id: meta.id.toLowerCase().replaceAll(/[^a-z0-9_]/, '_')]
}

def addQualityCategory = { meta ->
    def category = meta.quality > 40 ? 'excellent' :
                  meta.quality > 30 ? 'good' :
                  meta.quality > 20 ? 'acceptable' : 'poor'
    meta + [quality_category: category]
}

def addProcessingFlags = { meta ->
    meta + [
        needs_extra_qc: meta.quality < 30,
        high_priority: meta.organism == 'human' && meta.quality > 35
    ]
}

// Compose transformations using >> operator
def enrichSample = normalizeId >> addQualityCategory >> addProcessingFlags

workflow {
    Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)
        .map(enrichSample)          // Apply composed transformation
        .view { meta ->
            "Processed: ${meta.id} (${meta.quality_category}) - Extra QC: ${meta.needs_extra_qc}"
        }
}

main.nf
390

6.3. Currying for Specialized Functions¶

Currying allows you to create specialized versions of general-purpose closures by fixing some of their parameters:

Currying is a technique where you take a function with multiple parameters and create a new function with some of those parameters "fixed" or "pre-filled". This creates specialized versions of general-purpose functions.

AfterBefore

main.nf
// General-purpose filtering closure
def qualityFilter = { threshold, meta -> meta.quality >= threshold }

// Create specialized filters using currying
def highQualityFilter = qualityFilter.curry(40)
def standardQualityFilter = qualityFilter.curry(30)

workflow {
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)

    // Use the specialized filters in different channel operations
    ch_high_quality = ch_samples.filter(highQualityFilter)
    ch_standard_quality = ch_samples.filter(standardQualityFilter)

    // Both channels can be processed differently
    ch_high_quality.view { meta -> "High quality: ${meta.id}" }
    ch_standard_quality.view { meta -> "Standard quality: ${meta.id}" }
}

main.nf
430

6.4. Closures Accessing External Variables¶

Closures can access and modify variables from their defining scope, which is useful for collecting statistics:

AfterBefore

main.nf
workflow {
    // Variable in the workflow scope
    def sample_count = 0
    def human_samples = 0

    // Closure that accesses and modifies external variables
    def countSamples = { meta ->
        sample_count++  // Modifies external variable
        if (meta.organism == 'human') {
            human_samples++  // Modifies another external variable
        }
        return meta  // Pass data through unchanged
    }

    Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)
        .map(countSamples)          // Closure modifies external variables
        .collect()                  // Wait for all samples to be processed
        .view {
            "Processing complete: ${sample_count} total samples, ${human_samples} human samples"
        }
}

main.nf
480

Takeaway¶

In this section, you've learned:

Named closures for eliminating code duplication and improving readability
Function composition with >> operator to build complex transformations
Currying to create specialized versions of general-purpose closures
Variable scope access in closures for collecting statistics and generating reports

These advanced patterns help you write more maintainable, reusable workflows that follow functional programming principles while remaining easy to understand and debug.

With our pipeline now capable of intelligent routing, robust error handling, and advanced functional programming patterns, we're ready for the final enhancement. As your workflows scale to process hundreds or thousands of samples, you'll need sophisticated data processing capabilities that can organize, filter, and analyze large collections efficiently.

The functional programming patterns we just learned work beautifully with Groovy's powerful collection methods. Instead of writing loops and conditional logic, you can chain together expressive operations that clearly describe what you want to accomplish.

7. Collection Operations and File Path Manipulations¶

7.1. Common Collection Methods in Channel Operations¶

When processing large datasets, channel operations often need to organize and analyze sample collections. Groovy's collection methods integrate seamlessly with Nextflow channels to provide powerful data processing capabilities:

Groovy provides many built-in methods for working with collections (lists, maps, etc.) that make data processing much more expressive than traditional loops.

AfterBefore

main.nf
// Sample data with mixed quality and organisms
def samples = [
    [id: 'sample_001', organism: 'human', quality: 42, files: ['data1.txt', 'data2.txt']],
    [id: 'sample_002', organism: 'mouse', quality: 28, files: ['data3.txt']],
    [id: 'sample_003', organism: 'human', quality: 35, files: ['data4.txt', 'data5.txt', 'data6.txt']],
    [id: 'sample_004', organism: 'rat', quality: 45, files: ['data7.txt']],
    [id: 'sample_005', organism: 'human', quality: 30, files: ['data8.txt', 'data9.txt']]
]

// findAll - filter collections based on conditions
def high_quality_samples = samples.findAll { it.quality > 40 }
println "High quality samples: ${high_quality_samples.collect { it.id }.join(', ')}"

// groupBy - group samples by organism
def samples_by_organism = samples.groupBy { it.organism }
println "Grouping by organism:"
samples_by_organism.each { organism, sample_list ->
    println "  ${organism}: ${sample_list.size()} samples"
}

// unique - get unique organisms
def organisms = samples.collect { it.organism }.unique()
println "Unique organisms: ${organisms.join(', ')}"

// flatten - flatten nested file lists
def all_files = samples.collect { it.files }.flatten()
println "All files: ${all_files.take(5).join(', ')}... (${all_files.size()} total)"

// sort - sort samples by quality
def sorted_by_quality = samples.sort { it.quality }
println "Quality range: ${sorted_by_quality.first().quality} to ${sorted_by_quality.last().quality}"

// reverse - reverse the order
def reverse_quality = samples.sort { it.quality }.reverse()
println "Highest quality first: ${reverse_quality.collect { "${it.id}(${it.quality})" }.join(', ')}"

// count - count items matching condition
def human_samples = samples.count { it.organism == 'human' }
println "Human samples: ${human_samples} out of ${samples.size()}"

// any/every - check conditions across collection
def has_high_quality = samples.any { it.quality > 40 }
def all_have_files = samples.every { it.files.size() > 0 }
println "Has high quality samples: ${has_high_quality}"
println "All samples have files: ${all_have_files}"

main.nf
500

7.2. File Path Manipulations¶

Working with file paths is essential in bioinformatics workflows. Groovy provides many useful methods for extracting information from file paths:

AfterBefore

main.nf
// File path manipulation examples
def sample_files = [
    '/path/to/data/patient_001_R1.fastq.gz',
    '/path/to/data/patient_001_R2.fastq.gz',
    '/path/to/results/patient_002_analysis.bam',
    '/path/to/configs/experiment_setup.json'
]

sample_files.each { file_path ->
    def f = file(file_path)  // Create Nextflow file object

    println "\nFile: ${file_path}"
    println "  Name: ${f.getName()}"                    // Just filename
    println "  BaseName: ${f.getBaseName()}"            // Filename without extension
    println "  Extension: ${f.getExtension()}"          // File extension
    println "  Parent: ${f.getParent()}"                // Parent directory
    println "  Parent name: ${f.getParent().getName()}" // Just parent directory name

    // Extract sample ID from filename
    def matcher = f.getName() =~ /^(patient_\d+)/
    if (matcher) {
        println "  Sample ID: ${matcher[0][1]}"
    }
}

// Group files by sample ID using path manipulation
def files_by_sample = sample_files
    .findAll { it.contains('patient') }  // Only patient files
    .groupBy { file_path ->
        def filename = file(file_path).getName()
        def matcher = filename =~ /^(patient_\d+)/
        return matcher ? matcher[0][1] : 'unknown'
    }

println "\nFiles grouped by sample:"
files_by_sample.each { sample_id, files ->
    println "  ${sample_id}: ${files.size()} files"
}

main.nf
550

7.3. The Spread Operator¶

The spread operator (*.) is a powerful Groovy feature for calling methods on all elements in a collection:

The spread operator (*.) is a shorthand way to call the same method on every element in a collection. It's equivalent to using .collect { it.methodName() } but more concise.

AfterBefore

main.nf
// Spread operator examples
def file_paths = [
    '/data/sample1.fastq',
    '/data/sample2.fastq',
    '/results/output1.bam',
    '/results/output2.bam'
]

// Convert to file objects
def files = file_paths.collect { file(it) }

// Using spread operator - equivalent to files.collect { it.getName() }
def filenames = files*.getName()
println "Filenames: ${filenames.join(', ')}"

// Get all parent directories
def parent_dirs = files*.getParent()*.getName()
println "Parent directories: ${parent_dirs.unique().join(', ')}"

// Get all extensions
def extensions = files*.getExtension().unique()
println "File types: ${extensions.join(', ')}"

main.nf
590

Takeaway¶

In this section, you've learned:

Collection filtering with findAll and conditional logic
Grouping and organizing data with groupBy and sort
File path manipulation using Nextflow's file object methods
Spread operator (*.) for concise collection operations

These patterns help you process and organize complex datasets efficiently, which is essential for handling real-world bioinformatics data.

Summary¶

Throughout this side quest, you've built a comprehensive sample processing pipeline that evolved from basic metadata handling to a sophisticated, production-ready workflow. Each section built upon the previous, demonstrating how Groovy transforms simple Nextflow workflows into powerful data processing systems.

Here's how we progressively enhanced our pipeline:

Nextflow vs Groovy Boundaries: You learned to distinguish between workflow orchestration (Nextflow) and programming logic (Groovy), including the crucial differences between constructs like collect.
String Processing: You learned regular expressions, parsing functions, and file collection transformation for building dynamic command-line arguments.
Conditional Logic: You added intelligent routing that automatically selects analysis strategies based on sample characteristics like organism, quality scores, and sequencing depth.
Error Handling: You made the pipeline robust by adding validation functions, try-catch error handling, and configuration management with sensible defaults.
Essential Groovy Operators: You mastered safe navigation (?.), Elvis (?:), Groovy Truth, slashy strings, and other key language features that make code more resilient and readable.
Advanced Closures: You learned functional programming techniques including named closures, function composition, currying, and closures with variable scope access for building reusable, maintainable code.
Collection Operations: You added sophisticated data processing capabilities using Groovy collection methods like findAll, groupBy, unique, flatten, and the spread operator to handle large-scale sample processing.

Key Benefits¶

Clearer code: Understanding when to use Nextflow vs Groovy helps you write more organized workflows
Better error handling: Basic validation and try-catch patterns help your workflows handle problems gracefully
Flexible processing: Conditional logic lets your workflows process different sample types appropriately
Configuration management: Using defaults and simple validation makes your workflows easier to use

From Simple to Sophisticated¶

The pipeline journey you completed demonstrates the evolution from basic data processing to production-ready bioinformatics workflows:

Started simple: Basic CSV processing and metadata extraction with clear Nextflow vs Groovy boundaries
Added intelligence: Dynamic file name parsing with regex patterns and conditional routing based on sample characteristics
Made it robust: Null-safe operators, validation, error handling, and graceful failure management
Made it maintainable: Advanced closure patterns, function composition, and reusable components that eliminate code duplication
Scaled it efficiently: Collection operations for processing hundreds of samples with powerful data filtering and organization

This progression mirrors the real-world evolution of bioinformatics pipelines - from research prototypes handling a few samples to production systems processing thousands of samples across laboratories and institutions. Every challenge you solved and pattern you learned reflects actual problems developers face when scaling Nextflow workflows.

Next Steps¶

With these Groovy fundamentals mastered, you're ready to:

Write cleaner workflows with proper separation between Nextflow and Groovy logic
Transform file collections into properly formatted command-line arguments
Handle different file naming conventions and input formats gracefully
Build reusable, maintainable code using advanced closure patterns and functional programming
Process and organize complex datasets using collection operations
Add basic validation and error handling to make your workflows more user-friendly

Continue practicing these patterns in your own workflows, and refer to the Groovy documentation when you need to explore more advanced features.

Key Concepts Reference¶

Language Boundaries

Nextflow vs Groovy examples

// Nextflow: workflow orchestration
Channel.fromPath('*.fastq').splitCsv(header: true)

// Groovy: data processing
sample_data.collect { it.toUpperCase() }

String Processing

String processing examples

// Pattern matching
filename =~ ~/^(\w+)_(\w+)_(\d+)\.fastq$/

// Function with conditional return
def parseSample(filename) {
    def matcher = filename =~ pattern
    return matcher ? [valid: true, data: matcher[0]] : [valid: false]
}

// File collection to command arguments (in process script block)
script:
def file_args = input_files.collect { file -> "--input ${file}" }.join(' ')
"""
analysis_tool ${file_args} --output results.txt
"""

Error Handling

Error handling patterns

try {
    def errors = validateSample(sample)
    if (errors) throw new RuntimeException("Invalid: ${errors.join(', ')}")
} catch (Exception e) {
    println "Error: ${e.message}"
}

Essential Groovy Operators

Essential operators examples

// Safe navigation and Elvis operators
def id = data?.sample?.id ?: 'unknown'
if (sample.files) println "Has files"  // Groovy Truth

// Slashy strings for regex
def pattern = /^\w+_R[12]\.fastq$/
def script = """
echo "Processing ${sample.id}"
analysis --depth ${depth ?: 1_000_000}
"""

Advanced Closures

Advanced closure patterns

// Named closures and composition
def enrichData = normalizeId >> addQualityCategory >> addFlags
def processor = generalFunction.curry(fixedParam)

// Closures with scope access
def collectStats = { data -> stats.count++; return data }

Collection Operations

Collection operations examples

// Filter, group, and organize data
def high_quality = samples.findAll { it.quality > 40 }
def by_organism = samples.groupBy { it.organism }
def file_names = files*.getName()  // Spread operator
def all_files = nested_lists.flatten()

Groovy Essentials for Nextflow Developers¶

0. Warmup¶

0.1. Prerequisites¶

0.2. Starting Point¶

1. Nextflow vs Groovy: Understanding the Boundaries¶

1.1. Identifying What's What¶

Step 1: Basic Nextflow Workflow¶

Step 2: Adding the Map Operator¶

Step 3: Creating a Map Data Structure¶

Step 4: Adding Conditional Logic¶

Step 5: Combining Maps and Returning Results¶

1.2. Distinguishing Nextflow operators from Groovy functions¶

Takeaway¶

2. Advanced String Processing for Bioinformatics¶

2.1. Pattern Matching and Regular Expressions¶

2.2. Creating Reusable Parsing Functions¶

2.3. Dynamic Script Logic in Processes¶

2.4. Transforming File Collections into Command Arguments¶

Takeaway¶

3. Conditional Logic and Process Control¶

3.1. Strategy Selection Based on Sample Characteristics¶

3.2. Conditional Process Execution¶

3.3. Channel-based Workflow Routing¶

Takeaway¶

4. Error Handling and Validation Patterns¶

4.1. Basic Input Validation¶

4.2. Try-Catch Error Handling¶

4.3. Setting Defaults and Validation¶

Takeaway¶

5. Essential Groovy Operators and Patterns¶

5.1. Safe Navigation and Elvis Operators in Workflows¶

5.2. String Patterns and Multi-line Templates¶

5.3. Combining Operators for Robust Data Handling¶

Takeaway¶

6. Advanced Closures and Functional Programming¶

6.1. Named Closures for Reusability¶

6.2. Function Composition¶

6.3. Currying for Specialized Functions¶

6.4. Closures Accessing External Variables¶

Takeaway¶

7. Collection Operations and File Path Manipulations¶

7.1. Common Collection Methods in Channel Operations¶

7.2. File Path Manipulations¶

7.3. The Spread Operator¶

Takeaway¶

Summary¶

Key Benefits¶

From Simple to Sophisticated¶

Next Steps¶

Key Concepts Reference¶

Resources¶