Skip to content

Groovy Essentials for Nextflow Developers

Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Java Virtual Machine. While Nextflow provides the workflow orchestration framework, Groovy provides the programming language foundations that make your workflows flexible, maintainable, and powerful.

Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts.

Many Nextflow developers struggle with distinguishing when to use Nextflow versus Groovy features, processing file names and configurations, and handling errors gracefully. This side quest will bridge that gap by taking you on a journey from basic workflow concepts to production-ready pipeline mastery.

We'll transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production:

  • Messy data? We'll add robust parsing and null-safe operators, learning to distinguish between Nextflow and Groovy constructs
  • Complex file naming schemes? We'll master regex patterns and string manipulation for bioinformatics file names
  • Need intelligent sample routing? We'll implement conditional logic and strategy selection, transforming file collections into command-line arguments
  • Worried about failures? We'll add comprehensive error handling and validation patterns
  • Code getting repetitive? We'll learn functional programming with closures and composition, mastering essential Groovy operators like safe navigation and Elvis
  • Processing thousands of samples? We'll leverage powerful collection operations for file path manipulations

0. Warmup

0.1. Prerequisites

Before taking on this side quest you should:

  • Complete the Hello Nextflow tutorial
  • Understand basic Nextflow concepts (processes, channels, workflows)
  • Have basic familiarity with Groovy syntax (variables, maps, lists)

This tutorial will explain Groovy concepts as we encounter them, so you don't need extensive prior Groovy knowledge. We'll start with fundamental concepts and build up to advanced patterns.

0.2. Starting Point

Let's move into the project directory and explore our working materials.

Navigate to project directory
cd side-quests/groovy_essentials

You'll find a data directory with sample files and a main workflow file that we'll evolve throughout this tutorial.

Directory contents
> tree
.
├── data
│   ├── metadata
│   │   └── analysis_parameters.yaml
│   ├── samples.csv
│   └── sequences
│       ├── sample_001.fastq
│       ├── sample_002.fastq
│       └── sample_003.fastq
├── main.nf
├── nextflow.config
├── README.md
└── templates
    └── analysis_script.sh

5 directories, 9 files

Our sample CSV contains information about biological samples that need different processing based on their characteristics:

samples.csv
sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score
SAMPLE_001,human,liver,30000000,data/sequences/sample_001.fastq,38.5
SAMPLE_002,mouse,brain,25000000,data/sequences/sample_002.fastq,35.2
SAMPLE_003,human,kidney,45000000,data/sequences/sample_003.fastq,42.1

We'll use this realistic dataset to explore practical Groovy techniques that you'll encounter in real bioinformatics workflows.


1. Nextflow vs Groovy: Understanding the Boundaries

1.1. Identifying What's What

One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's build a workflow step by step to see how they work together.

Step 1: Basic Nextflow Workflow

Start with a simple workflow that just reads the CSV file:

main.nf
1
2
3
4
5
workflow {
    ch_samples = Channel.fromPath("./data/samples.csv")
        .splitCsv(header: true)
        .view()
}

The workflow block defines our pipeline structure, while Channel.fromPath() creates a channel from a file path. The .splitCsv() operator processes the CSV file and converts each row into a map data structure.

Run this workflow to see the raw CSV data:

Test basic workflow
nextflow run main.nf

You should see output like:

Raw CSV data
[id:sample_001, organism:human, tissue_type:liver, sequencing_depth:30000000, file_path:data/sequences/sample_001.fastq, quality_score:38.5]
[id:sample_002, organism:mouse, tissue_type:brain, sequencing_depth:25000000, file_path:data/sequences/sample_002.fastq, quality_score:35.2]
[id:sample_003, organism:human, tissue_type:kidney, sequencing_depth:45000000, file_path:data/sequences/sample_003.fastq, quality_score:42.1]

Step 2: Adding the Map Operator

Now let's add the .map() operator, which is a Nextflow channel operator (not to be confused with the map data structure we'll see below). This operator takes a closure where we can write Groovy code to transform each item.

A closure is a block of code that can be passed around and executed later. Think of it as a function that you define inline. In Groovy, closures are written with curly braces { } and can take parameters. They're fundamental to how Nextflow operators work.

main.nf
2
3
4
5
6
7
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            return row
        }
        .view()
main.nf
2
3
4
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .view()

The .map { row -> ... } operator takes a closure where row represents each item from the channel. This is a named parameter - you can call it anything you want. For example, you could write .map { item -> ... } or .map { sample -> ... } and it would work exactly the same way.

When Nextflow processes each item in the channel, it passes that item to your closure as the parameter you named. So if your channel contains CSV rows, row will hold one complete row at a time.

Apply this change and run the workflow:

Test map operator
nextflow run main.nf

You'll see the same output as before, because we're simply returning the input unchanged. This confirms that the map operator is working correctly. Now let's start transforming the data.

Step 3: Creating a Map Data Structure

Now we're going to write pure Groovy code inside our closure. Everything from this point forward is Groovy syntax and methods, not Nextflow operators.

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            // This is all Groovy code now!
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            return sample_meta
        }
        .view()
main.nf
2
3
4
5
6
7
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            return row
        }
        .view()

Notice how we've left Nextflow syntax behind and are now writing pure Groovy code. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. It lets us store related pieces of information together. In this map, we're storing the sample ID, organism, tissue type, sequencing depth, and quality score.

We use Groovy's string manipulation methods like .toLowerCase() and .replaceAll() to clean up our data, and type conversion methods like .toInteger() and .toDouble() to convert string data from the CSV into the appropriate numeric types.

Apply this change and run the workflow:

Test map data structure
nextflow run main.nf

You should see the refined map output like:

Transformed metadata
[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5]
[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2]
[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1]

Step 4: Adding Conditional Logic

Now let's add more Groovy logic - this time using a ternary operator to make decisions based on data values.

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            def priority = sample_meta.quality > 40 ? 'high' : 'normal'
            return sample_meta + [priority: priority]
        }
        .view()
main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            return sample_meta
        }
        .view()

The ternary operator is a shorthand for an if/else statement that follows the pattern condition ? value_if_true : value_if_false. This line means: "If the quality is greater than 40, use 'high', otherwise use 'normal'".

The map addition operator + creates a new map rather than modifying the existing one. This line creates a new map that contains all the key-value pairs from sample_meta plus the new priority key.

Apply this change and run the workflow:

Test conditional logic
nextflow run main.nf

You should see output like:

Metadata with priority
[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal]
[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal]
[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high]

Step 5: Combining Maps and Returning Results

Finally, let's use Groovy's map addition operator to combine our metadata, then return a tuple that follows Nextflow's standard pattern.

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            def priority = sample_meta.quality > 40 ? 'high' : 'normal'
            return [sample_meta + [priority: priority], file(row.file_path)]
        }
        .view()
main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map { row ->
            def sample_meta = [
                id: row.sample_id.toLowerCase(),
                organism: row.organism,
                tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(),
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]
            def priority = sample_meta.quality > 40 ? 'high' : 'normal'
            return sample_meta + [priority: priority]
        }
        .view()

This returns a tuple containing the enriched metadata and the file path, which is the standard pattern for passing data to processes in Nextflow.

Apply this change and run the workflow:

Test complete workflow
nextflow run main.nf

You should see output like:

Complete workflow output
[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_001.fastq]
[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_002.fastq]
[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_003.fastq]

Note

Key Pattern: Nextflow operators often take closures { ... } as parameters. Everything inside these closures is Groovy code. This is how Nextflow orchestrates workflows while Groovy handles the data processing logic.

Note

Maps and Metadata: Maps are fundamental to working with metadata in Nextflow. For a more detailed explanation of working with metadata maps, see the Working with metadata side quest.

Our workflow demonstrates the core pattern: Nextflow constructs (workflow, Channel.fromPath(), .splitCsv(), .map(), .view()) orchestrate data flow, while basic Groovy constructs (maps [key: value], string methods, type conversions, ternary operators) handle the data processing logic inside the .map() closure.

1.2. Distinguishing Nextflow operators from Groovy functions

Having a clear understanding of which parts of your code are using basic Groovy is especially important when syntax overlaps between the two languages.

A perfect example of this confusion is the collect operation, which exists in both contexts but does completely different things. Groovy's collect transforms each element, while Nextflow's collect gathers all channel elements into a single-item channel.

Let's demonstrate this with some sample data. Check out collect.nf:

collect.nf
// Demonstrate Groovy vs Nextflow collect
def sample_ids = ['sample_001', 'sample_002', 'sample_003']

println "=== GROOVY COLLECT (transforms each item, keeps same structure) ==="
// Groovy collect: transforms each element but maintains list structure
def formatted_ids = sample_ids.collect { id ->
    id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_')
}
println "Original list: ${sample_ids}"
println "Groovy collect result: ${formatted_ids}"
println "Groovy collect maintains structure: ${formatted_ids.size} items (same as original)"
println ""

println "\n=== NEXTFLOW COLLECT (groups multiple items into single emission) ==="
// Nextflow collect: groups channel elements into a single emission
ch_input = Channel.of('sample_001', 'sample_002', 'sample_003')

// Show individual items before collect
ch_input.view { "Individual channel item: ${it}" }

// Collect groups all items into a single emission
ch_collected = ch_input.collect()
ch_collected.view { "Nextflow collect result: ${it} (${it.size()} items grouped together)" }

Run the workflow to see both collect operations in action:

Test collect operations
nextflow run collect.nf
Different collect behaviors
 N E X T F L O W   ~  version 25.04.6

Launching `collect.nf` [silly_bhaskara] DSL2 - revision: 5ef004224c

=== GROOVY COLLECT (transforms each item, keeps same structure) ===
Original list: [sample_001, sample_002, sample_003]
Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003]
Groovy collect maintains structure: 3 items (same as original)

=== NEXTFLOW COLLECT (groups multiple items into single emission) ===
Individual channel item: sample_001
Individual channel item: sample_002
Individual channel item: sample_003
Nextflow collect result: [sample_001, sample_002, sample_003] (3 items grouped together)

The key difference: Groovy's collect transforms items but preserves structure (like Nextflow's map), while Nextflow's collect() groups multiple channel emissions into a single list.

But collect really isn't the main point. The key lesson: always distinguish between Groovy constructs (data structures) and Nextflow constructs (channels/workflows). Operations can share names but behave completely differently.

Takeaway

In this section, you've learned:

  • Distinguishing Nextflow from Groovy: How to identify which language construct you're using
  • Context matters: The same operation name can have completely different behaviors
  • Workflow structure: Nextflow provides the orchestration, Groovy provides the logic
  • Data transformation patterns: When to use Groovy methods vs Nextflow operators

Understanding these boundaries is essential for debugging, documentation, and writing maintainable workflows.

Now that we can distinguish between Nextflow and Groovy constructs, let's enhance our sample processing pipeline with more sophisticated data handling capabilities.


2. Advanced String Processing for Bioinformatics

Our basic pipeline processes CSV metadata well, but this is just the beginning. In production bioinformatics, you'll encounter files from different sequencing centers with varying naming conventions, legacy datasets with non-standard formats, and the need to extract critical information from filenames themselves.

The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to mastering Groovy's string processing capabilities. Let's transform our pipeline to handle the messy realities of real-world bioinformatics data.

2.1. Pattern Matching and Regular Expressions

Many bioinformatics workflows encounter files with complex naming conventions that encode important metadata. Let's see how Groovy's pattern matching can extract this information automatically.

Let's start with a simple example of extracting sample information from file names:

main.nf
// Pattern matching for sample file names
def sample_files = [
    'Human_Liver_001.fastq',
    'mouse_brain_002.fastq',
    'SRR12345678.fastq'
]

// Simple pattern to extract organism and tissue
def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/

sample_files.each { filename ->
    def matcher = filename =~ pattern
    if (matcher) {
        println "${filename} -> Organism: ${matcher[0][1]}, Tissue: ${matcher[0][2]}, ID: ${matcher[0][3]}"
    } else {
        println "${filename} -> No standard pattern match"
    }
}
main.nf

This demonstrates key Groovy string processing concepts:

  1. Regular expression literals using ~/pattern/ syntax - this creates a regex pattern without needing to escape backslashes
  2. Pattern matching with the =~ operator - this attempts to match a string against a regex pattern
  3. Matcher objects that capture groups with [0][1], [0][2], etc. - [0] refers to the entire match, [1], [2], etc. refer to captured groups in parentheses

Run this to see the pattern matching in action:

Test pattern matching
nextflow run main.nf
Pattern matching results
Human_Liver_001.fastq -> Organism: Human, Tissue: Liver, ID: 001
mouse_brain_002.fastq -> Organism: mouse, Tissue: brain, ID: 002
SRR12345678.fastq -> No standard pattern match

2.2. Creating Reusable Parsing Functions

Let's create a simple function to parse sample names and return structured metadata:

main.nf
// Function to extract sample metadata from filename
def parseSampleName(String filename) {
    def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/
    def matcher = filename =~ pattern

    if (matcher) {
        return [
            organism: matcher[0][1].toLowerCase(),
            tissue: matcher[0][2].toLowerCase(),
            sample_id: matcher[0][3],
            valid: true
        ]
    } else {
        return [
            filename: filename,
            valid: false
        ]
    }
}

// Test the parser
sample_files.each { filename ->
    def parsed = parseSampleName(filename)
    println "File: ${filename}"
    if (parsed.valid) {
        println "  Organism: ${parsed.organism}, Tissue: ${parsed.tissue}, ID: ${parsed.sample_id}"
    } else {
        println "  Could not parse filename"
    }
}
main.nf

This demonstrates key Groovy function patterns:

  • Function definitions with def functionName(parameters) - similar to other languages but with dynamic typing
  • Map creation and return for structured data - maps are Groovy's primary data structure for returning multiple values
  • Conditional returns based on pattern matching success - functions can return different data structures based on conditions

2.3. Dynamic Script Logic in Processes

In Nextflow, dynamic behavior comes from using Groovy logic within process script blocks, not generating script strings. Here are realistic patterns:

main.nf
// Process with conditional script logic
process QUALITY_FILTER {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}_filtered.fastq")

    script:
    // Groovy logic to determine parameters based on metadata
    def quality_threshold = meta.organism == 'human' ? 30 :
                           meta.organism == 'mouse' ? 28 : 25
    def min_length = meta.priority == 'high' ? 75 : 50

    // Conditional script sections
    def extra_qc = meta.priority == 'high' ? '--strict-quality' : ''

    """
    echo "Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})"

    # Dynamic quality filtering based on sample characteristics
    fastp \\
        --in1 ${reads} \\
        --out1 ${meta.id}_filtered.fastq \\
        --qualified_quality_phred ${quality_threshold} \\
        --length_required ${min_length} \\
        ${extra_qc}

    echo "Applied quality threshold: ${quality_threshold}"
    echo "Applied length threshold: ${min_length}"
    """
}

// Process with completely different scripts based on organism
process ALIGN_READS {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}.bam")

    script:
    if (meta.organism == 'human') {
        """
        echo "Using human-specific STAR alignment"
        STAR --runMode alignReads \\
            --genomeDir /refs/human/STAR \\
            --readFilesIn ${reads} \\
            --outSAMtype BAM SortedByCoordinate \\
            --outFileNamePrefix ${meta.id}
        mv ${meta.id}Aligned.sortedByCoord.out.bam ${meta.id}.bam
        """
    } else if (meta.organism == 'mouse') {
        """
        echo "Using mouse-specific bowtie2 alignment"
        bowtie2 -x /refs/mouse/genome \\
            -U ${reads} \\
            --sensitive \\
            | samtools sort -o ${meta.id}.bam -
        """
    } else {
        """
        echo "Using generic alignment for ${meta.organism}"
        minimap2 -ax sr /refs/generic/genome.fa ${reads} \\
            | samtools sort -o ${meta.id}.bam -
        """
    }
}

// Using templates (Nextflow's built-in templating)
process GENERATE_REPORT {
    input:
    tuple val(meta), path(results)

    output:
    path("${meta.id}_report.html")

    script:
    template 'report_template.sh'
}
main.nf

Now let's look at the template file that would go with this:

templates/report_template.sh
#!/bin/bash

# This template has access to all variables from the process input
# Groovy expressions are evaluated at runtime

echo "Generating report for sample: ${meta.id}"
echo "Organism: ${meta.organism}"
echo "Quality score: ${meta.quality}"

# Conditional logic in template
<% if (meta.organism == 'human') { %>
echo "Including human-specific quality metrics"
human_qc_script.py --input ${results} --output ${meta.id}_report.html
<% } else { %>
echo "Using standard quality metrics for ${meta.organism}"
generic_qc_script.py --input ${results} --output ${meta.id}_report.html
<% } %>

# Groovy variables can be used for calculations
<%
def priority_bonus = meta.priority == 'high' ? 0.1 : 0.0
def adjusted_score = (meta.quality + priority_bonus).round(2)
%>

echo "Adjusted quality score: ${adjusted_score}"
echo "Report generation complete"
templates/report_template.sh

This demonstrates realistic Nextflow patterns:

  • Conditional script blocks using Groovy if/else in the script section
  • Variable interpolation directly in script blocks
  • Template files with Groovy expressions (using <% %> and ${})
  • Dynamic parameter calculation based on metadata

2.4. Transforming File Collections into Command Arguments

A particularly powerful pattern is using Groovy logic in the script block to transform collections of files into properly formatted command-line arguments. This is essential when tools expect multiple files as separate arguments:

main.nf
// Process that needs to handle multiple input files
process JOINT_ANALYSIS {
    input:
    path sample_files  // This will be a list of files
    path reference

    output:
    path "joint_results.txt"

    script:
    // Transform file list into command arguments
    def file_args = sample_files.collect { file -> "--input ${file}" }.join(' ')
    def sample_names = sample_files.collect { file ->
        file.baseName.replaceAll(/\..*$/, '')
    }.join(',')

    """
    echo "Processing ${sample_files.size()} samples"
    echo "Sample names: ${sample_names}"

    # Use the transformed arguments in the actual command
    analysis_tool \\
        ${file_args} \\
        --reference ${reference} \\
        --output joint_results.txt \\
        --samples ${sample_names}
    """
}

// Process that builds complex command based on file characteristics
process VARIABLE_COMMAND {
    input:
    tuple val(meta), path(files)

    output:
    path "${meta.id}_processed.txt"

    script:
    // Complex command building based on file types and metadata
    def input_flags = files.collect { file ->
        def extension = file.getExtension()
        switch(extension) {
            case 'bam':
                return "--bam-input ${file}"
            case 'vcf':
                return "--vcf-input ${file}"
            case 'bed':
                return "--intervals ${file}"
            default:
                return "--data-input ${file}"
        }
    }.join(' ')

    // Additional flags based on metadata
    def extra_flags = meta.quality > 35 ? '--high-quality' : ''

    """
    echo "Building command for ${meta.id}"

    variant_caller \\
        ${input_flags} \\
        ${extra_flags} \\
        --output ${meta.id}_processed.txt
    """
}
main.nf

Key patterns demonstrated:

  • File collection transformation: Using .collect{} to transform each file into a command argument
  • String joining: Using .join(' ') to combine arguments with spaces
  • File name manipulation: Using .baseName and .replaceAll() for sample names
  • Conditional argument building: Using switch statements or conditionals to build different arguments based on file types
  • Multiple transformations: Building both file arguments and sample name lists from the same collection

Takeaway

In this section, you've learned:

  • Regular expression patterns for bioinformatics file name parsing
  • Reusable parsing functions that return structured metadata
  • Process script logic with conditional parameter selection
  • File collection transformation into command-line arguments using .collect{} and .join()
  • Command building patterns based on file types and metadata

These string processing techniques form the foundation for handling complex data pipelines that need to adapt to different input formats and generate appropriate commands for bioinformatics tools.

With our pipeline now capable of extracting rich metadata from both CSV files and file names, we can make intelligent decisions about how to process different samples. Let's add conditional logic to route samples through appropriate analysis strategies.


3. Conditional Logic and Process Control

3.1. Strategy Selection Based on Sample Characteristics

Now that our pipeline can extract comprehensive sample metadata, we can use this information to automatically select the most appropriate analysis strategy for each sample. Different organisms, sequencing depths, and quality scores require different processing approaches.

main.nf
// Dynamic process selection based on sample characteristics
def selectAnalysisStrategy(Map sample_meta) {
    def strategy = [:]

    // Sequencing depth determines processing approach
    if (sample_meta.depth < 10_000_000) {
        strategy.approach = 'low_depth'
        strategy.processes = ['quality_check', 'simple_alignment']
        strategy.sensitivity = 'high'
    } else if (sample_meta.depth < 50_000_000) {
        strategy.approach = 'standard'
        strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling']
        strategy.sensitivity = 'standard'
    } else {
        strategy.approach = 'high_depth'
        strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling', 'structural_variants']
        strategy.sensitivity = 'sensitive'
    }

    // Organism-specific adjustments
    switch(sample_meta.organism) {
        case 'human':
            strategy.reference = 'GRCh38'
            strategy.known_variants = 'dbSNP'
            break
        case 'mouse':
            strategy.reference = 'GRCm39'
            strategy.known_variants = 'mgp_variants'
            break
        default:
            strategy.reference = 'custom'
            strategy.known_variants = null
    }

    // Quality-based modifications
    if (sample_meta.quality < 30) {
        strategy.extra_qc = true
        strategy.processes = ['extensive_qc'] + strategy.processes
    }

    return strategy
}

// Test strategy selection
ch_samples
    .map { meta, file ->
        def strategy = selectAnalysisStrategy(meta)
        println "\nSample: ${meta.id}"
        println "  Strategy: ${strategy.approach}"
        println "  Processes: ${strategy.processes.join(' -> ')}"
        println "  Reference: ${strategy.reference}"
        println "  Extra QC: ${strategy.extra_qc ?: false}"

        return [meta + strategy, file]
    }
    .view { meta, file -> "Ready for processing: ${meta.id} (${meta.approach})" }
main.nf

This demonstrates several Groovy patterns commonly used in Nextflow workflows:

  • Numeric literals with underscores for readability (10_000_000) - underscores can be used in numbers to improve readability
  • Switch statements for multi-way branching - cleaner than multiple if/else statements
  • List concatenation with + operator - combines two lists into one
  • Elvis operator ?: for null handling - provides a default value if the left side is null or false
  • Map merging to combine metadata with strategy - the + operator merges two maps, with the right map taking precedence

3.2. Conditional Process Execution

In Nextflow, you control which processes run for which samples using when conditions and channel routing:

main.nf
// Different processes for different strategies
process BASIC_QC {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}_basic_qc.html")

    when:
    meta.approach == 'low_depth'

    script:
    """
    fastqc --quiet ${reads} -o ./
    mv *_fastqc.html ${meta.id}_basic_qc.html
    """
}

process COMPREHENSIVE_QC {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}_comprehensive_qc.html")

    when:
    meta.approach in ['standard', 'high_depth']

    script:
    def sensitivity = meta.sensitivity == 'high' ? '--strict' : ''
    """
    fastqc ${sensitivity} ${reads} -o ./
    # Additional QC for comprehensive analysis
    seqtk fqchk ${reads} > sequence_stats.txt
    mv *_fastqc.html ${meta.id}_comprehensive_qc.html
    """
}

process SIMPLE_ALIGNMENT {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}.bam")

    when:
    meta.approach == 'low_depth'

    script:
    """
    minimap2 -ax sr ${meta.reference} ${reads} \\
        | samtools sort -o ${meta.id}.bam -
    """
}

process SENSITIVE_ALIGNMENT {
    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("${meta.id}.bam")

    when:
    meta.approach in ['standard', 'high_depth']

    script:
    def params = meta.sensitivity == 'sensitive' ? '--very-sensitive' : '--sensitive'
    """
    bowtie2 ${params} -x ${meta.reference} -U ${reads} \\
        | samtools sort -o ${meta.id}.bam -
    """
}

// Workflow logic that routes to appropriate processes
workflow ANALYSIS_PIPELINE {
    take:
    samples_ch

    main:
    // All samples go through appropriate QC
    basic_qc_results = BASIC_QC(samples_ch)
    comprehensive_qc_results = COMPREHENSIVE_QC(samples_ch)

    // Combine QC results
    qc_results = basic_qc_results.mix(comprehensive_qc_results)

    // All samples go through appropriate alignment
    simple_alignment_results = SIMPLE_ALIGNMENT(samples_ch)
    sensitive_alignment_results = SENSITIVE_ALIGNMENT(samples_ch)

    // Combine alignment results
    alignment_results = simple_alignment_results.mix(sensitive_alignment_results)

    emit:
    qc = qc_results
    alignments = alignment_results
}
main.nf

This shows realistic Nextflow patterns:

  • Separate processes for different strategies rather than dynamic generation
  • When conditions to control which processes run for which samples
  • Mix operator to combine results from different conditional processes
  • Process parameterization using metadata in script blocks

3.3. Channel-based Workflow Routing

The realistic way to handle conditional workflow assembly is through channel routing and filtering:

main.nf
workflow {
    // Read and enrich sample data with strategy
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row ->
            def meta = [
                id: row.sample_id,
                organism: row.organism,
                depth: row.sequencing_depth.toInteger(),
                quality: row.quality_score.toDouble()
            ]

            // Add strategy information using our selectAnalysisStrategy function
            def strategy = selectAnalysisStrategy(meta)

            return [meta + strategy, file(row.file_path)]
        }

    // Split channel based on strategy requirements
    ch_samples
        .branch { meta, reads ->
            low_depth: meta.approach == 'low_depth'
                return [meta, reads]
            standard: meta.approach == 'standard'
                return [meta, reads]
            high_depth: meta.approach == 'high_depth'
                return [meta, reads]
        }
        .set { samples_by_strategy }

    // Route each strategy through appropriate processes
    ANALYSIS_PIPELINE(samples_by_strategy.low_depth)
    ANALYSIS_PIPELINE(samples_by_strategy.standard)
    ANALYSIS_PIPELINE(samples_by_strategy.high_depth)

    // For high-depth samples, also run structural variant calling
    high_depth_alignments = ANALYSIS_PIPELINE.out.alignments
        .filter { meta, bam -> meta.approach == 'high_depth' }

    STRUCTURAL_VARIANTS(high_depth_alignments)

    // Collect and organize all results
    all_qc = ANALYSIS_PIPELINE.out.qc.collect()
    all_alignments = ANALYSIS_PIPELINE.out.alignments.collect()

    // Generate summary report based on what was actually run
    all_alignments
        .map { alignments ->
            def strategies = alignments.collect { meta, bam -> meta.approach }.unique()
            def total_samples = alignments.size()

            println "Pipeline Summary:"
            println "  Total samples processed: ${total_samples}"
            println "  Strategies used: ${strategies.join(', ')}"

            strategies.each { strategy ->
                def count = alignments.count { meta, bam -> meta.approach == strategy }
                println "    ${strategy}: ${count} samples"
            }
        }
        .view()
}

// Additional process for high-depth samples
process STRUCTURAL_VARIANTS {
    input:
    tuple val(meta), path(bam)

    output:
    tuple val(meta), path("${meta.id}.vcf")

    script:
    """
    delly call -g ${meta.reference} ${bam} -o ${meta.id}.vcf
    """
}
main.nf

Key Nextflow patterns demonstrated:

  • Channel branching with .branch{} to split samples by strategy
  • Conditional process execution using when: directives and filtering
  • Channel routing to send different samples through different processes
  • Result collection and summary generation
  • Process reuse - the same workflow processes different sample types

Takeaway

In this section, you've learned:

  • Strategy selection using Groovy conditional logic
  • Process control with when conditions and channel routing
  • Workflow branching using channel operators like .branch() and .filter()
  • Metadata enrichment to drive process selection

These patterns help you write workflows that process different sample types appropriately while keeping your code organized and maintainable.

Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's add validation and error handling to make our pipeline robust enough for real-world use.


4. Error Handling and Validation Patterns

4.1. Basic Input Validation

Before our pipeline processes samples through complex conditional logic, we should validate that the input data meets our requirements. Let's create validation functions that check sample metadata and provide useful error messages:

main.nf
// Simple validation function
def validateSample(Map sample) {
    def errors = []

    // Check required fields
    if (!sample.sample_id) {
        errors << "Missing sample_id"
    }

    if (!sample.organism) {
        errors << "Missing organism"
    }

    // Validate organism
    def valid_organisms = ['human', 'mouse', 'rat']
    if (sample.organism && !valid_organisms.contains(sample.organism.toLowerCase())) {
        errors << "Invalid organism: ${sample.organism}"
    }

    // Check sequencing depth is numeric
    if (sample.sequencing_depth) {
        try {
            def depth = sample.sequencing_depth as Integer
            if (depth < 1000000) {
                errors << "Sequencing depth too low: ${depth}"
            }
        } catch (NumberFormatException e) {
            errors << "Invalid sequencing depth: ${sample.sequencing_depth}"
        }
    }

    return errors
}

// Test validation
def test_samples = [
    [sample_id: 'SAMPLE_001', organism: 'human', sequencing_depth: '30000000'],
    [sample_id: '', organism: 'alien', sequencing_depth: 'invalid'],
    [sample_id: 'SAMPLE_003', organism: 'mouse', sequencing_depth: '500000']
]

test_samples.each { sample ->
    def errors = validateSample(sample)
    if (errors) {
        println "Sample ${sample.sample_id}: ${errors.join(', ')}"
    } else {
        println "Sample ${sample.sample_id}: Valid"
    }
}
main.nf

4.2. Try-Catch Error Handling

Let's implement simple try-catch patterns for handling errors:

main.nf
// Process sample with error handling
def processSample(Map sample) {
    try {
        // Validate first
        def errors = validateSample(sample)
        if (errors) {
            throw new RuntimeException("Validation failed: ${errors.join(', ')}")
        }

        // Simulate processing
        def result = [
            id: sample.sample_id,
            organism: sample.organism,
            processed: true
        ]

        println "✓ Successfully processed ${sample.sample_id}"
        return result

    } catch (Exception e) {
        println "✗ Error processing ${sample.sample_id}: ${e.message}"

        // Return partial result
        return [
            id: sample.sample_id ?: 'unknown',
            organism: sample.organism ?: 'unknown',
            processed: false,
            error: e.message
        ]
    }
}

// Test error handling
test_samples.each { sample ->
    def result = processSample(sample)
    println "Result for ${result.id}: processed = ${result.processed}"
}
main.nf

4.3. Setting Defaults and Validation

Let's create a simple function that provides defaults and validates configuration:

main.nf
// Simple configuration with defaults
def getConfig(Map user_params) {
    // Set defaults
    def defaults = [
        quality_threshold: 30,
        max_cpus: 4,
        output_dir: './results'
    ]

    // Merge user params with defaults
    def config = defaults + user_params

    // Simple validation
    if (config.quality_threshold < 0 || config.quality_threshold > 40) {
        println "Warning: Quality threshold ${config.quality_threshold} out of range, using default"
        config.quality_threshold = defaults.quality_threshold
    }

    if (config.max_cpus < 1) {
        println "Warning: Invalid CPU count ${config.max_cpus}, using default"
        config.max_cpus = defaults.max_cpus
    }

    return config
}

// Test configuration
def test_configs = [
    [:], // Empty - should get defaults
    [quality_threshold: 35, max_cpus: 8], // Valid values
    [quality_threshold: -5, max_cpus: 0] // Invalid values
]

test_configs.each { user_config ->
    def config = getConfig(user_config)
    println "Input: ${user_config} -> Output: ${config}"
}
main.nf

Takeaway

In this section, you've learned:

  • Basic validation functions that check required fields and data types
  • Try-catch error handling for graceful failure handling
  • Configuration with defaults using map merging and validation

These patterns help you write workflows that handle invalid input gracefully and provide useful feedback to users.

Before diving into advanced closures, let's master some essential Groovy language features that make code more concise and null-safe. These operators and patterns are used throughout production Nextflow workflows and will make your code more robust and readable.


5. Essential Groovy Operators and Patterns

With our pipeline now handling complex conditional logic, we need to make it more robust against missing or malformed data. Bioinformatics workflows often deal with incomplete metadata, optional configuration parameters, and varying input formats. Let's enhance our pipeline with essential Groovy operators that handle these challenges gracefully.

5.1. Safe Navigation and Elvis Operators in Workflows

Note

Safe Navigation (?.) and Elvis (?:) Operators: These are essential for null-safe programming. Safe navigation returns null instead of throwing an exception if the object is null, while the Elvis operator provides a default value if the left side is null, empty, or false.

The safe navigation operator (?.) and Elvis operator (?:) are essential for null-safe programming when processing real-world biological data:

  • Safe navigation (?.) - returns null instead of throwing an exception if the object is null
  • Elvis operator (?:) - provides a default value if the left side is null, empty, or false
main.nf
workflow {
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row ->
            // Safe navigation prevents crashes on missing fields
            def sample_id = row.sample_id?.toLowerCase() ?: 'unknown_sample'
            def organism = row.organism?.toLowerCase() ?: 'unknown'

            // Elvis operator provides defaults
            def quality = (row.quality_score as Double) ?: 30.0
            def depth = (row.sequencing_depth as Integer) ?: 1_000_000

            // Chain operators for conditional defaults
            def reference = row.reference ?: (organism == 'human' ? 'GRCh38' : 'custom')

            // Groovy Truth - empty strings and nulls are false
            def priority = row.priority ?: (quality > 40 ? 'high' : 'normal')

            return [
                id: sample_id,
                organism: organism,
                quality: quality,
                depth: depth,
                reference: reference,
                priority: priority
            ]
        }
        .view { meta ->
            "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}"
        }
}
main.nf

5.2. String Patterns and Multi-line Templates

Groovy provides powerful string features for parsing filenames and generating dynamic commands:

main.nf
workflow {
    // Demonstrate slashy strings for regex (no need to escape backslashes)
    def parseFilename = { filename ->
                // Slashy string - compare to regular string: "^(\\w+)_(\\w+)_(\\d+)\\.fastq$"
    // Slashy strings don't require escaping backslashes, making regex patterns much cleaner
    def pattern = /^(\w+)_(\w+)_(\d+)\.fastq$/
        def matcher = filename =~ pattern

        if (matcher) {
            return [
                organism: matcher[0][1].toLowerCase(),
                tissue: matcher[0][2].toLowerCase(),
                sample_id: matcher[0][3]
            ]
        } else {
            return [organism: 'unknown', tissue: 'unknown', sample_id: 'unknown']
        }
    }

    // Multi-line strings with interpolation for command generation
    def generateCommand = { meta ->
        def depth_category = meta.depth > 10_000_000 ? 'high' : 'standard'
        def db_path = meta.organism == 'human' ? '/db/human' : '/db/other'

        // Multi-line string with variable interpolation
        """
        echo "Processing ${meta.organism} sample: ${meta.sample_id}"
        analysis_tool \\
            --sample ${meta.sample_id} \\
            --depth-category ${depth_category} \\
            --database ${db_path} \\
            --threads ${params.max_cpus ?: 4}
        """
    }

    // Test the patterns
    ch_files = Channel.of('Human_Liver_001.fastq', 'Mouse_Brain_002.fastq')
        .map { filename ->
            def parsed = parseFilename(filename)
            def command = generateCommand([sample_id: parsed.sample_id, organism: parsed.organism, depth: 15_000_000])
            return [parsed, command]
        }
        .view { parsed, command -> "Parsed: ${parsed}, Command: ${command.split('\n')[0]}..." }
}
main.nf

5.3. Combining Operators for Robust Data Handling

Let's combine these operators in a realistic workflow scenario:

main.nf
workflow {
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row ->
            // Combine safe navigation and Elvis operators
            def meta = [
                id: row.sample_id?.toLowerCase() ?: 'unknown',
                organism: row.organism ?: 'unknown',
                quality: (row.quality_score as Double) ?: 30.0,
                files: row.file_path ? [file(row.file_path)] : []
            ]

            // Use Groovy Truth for validation
            if (meta.files && meta.id != 'unknown') {
                return [meta, meta.files]
            } else {
                log.info "Skipping sample with missing data: ${meta.id}"
                return null
            }
        }
        .filter { it != null }  // Remove invalid samples using Groovy Truth
        .view { meta, files ->
            "Valid sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}"
        }
}
main.nf

Takeaway

In this section, you've learned:

  • Safe navigation operator (?.) for null-safe property access
  • Elvis operator (?:) for default values and null coalescing

Note

Groovy Truth: In Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts. This is different from many other languages and is essential to understand for proper conditional logic.

  • Groovy Truth - how null, empty strings, and empty collections evaluate to false - in Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts
  • Slashy strings (/pattern/) for regex patterns without escaping
  • Multi-line string interpolation for command templates
  • Numeric literals with underscores for improved readability

These patterns make your code more resilient to missing data and easier to read, which is essential when processing diverse bioinformatics datasets.


6. Advanced Closures and Functional Programming

Our pipeline now handles missing data gracefully and processes complex input formats robustly. But as our workflow grows more sophisticated, we start seeing repeated patterns in our data transformation code. Instead of copy-pasting similar closures across different channel operations, let's learn how to create reusable, composable functions that make our code cleaner and more maintainable.

6.1. Named Closures for Reusability

Note

Closures: A closure is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused. They're fundamental to Groovy's functional programming capabilities.

So far we've used anonymous closures defined inline within channel operations. When you find yourself repeating the same transformation logic across multiple processes or workflows, named closures can eliminate duplication and improve readability:

A closure is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused.

main.nf
// Define reusable closures for common transformations
def extractSampleInfo = { row ->
    [
        id: row.sample_id.toLowerCase(),
        organism: row.organism,
        quality: row.quality_score.toDouble(),
        depth: row.sequencing_depth.toInteger()
    ]
}

def addPriority = { meta ->
    meta + [priority: meta.quality > 40 ? 'high' : 'normal']
}

def formatForDisplay = { meta, file_path ->
    "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}"
}

workflow {
    // Use named closures in channel operations
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)        // Named closure
        .map(addPriority)              // Named closure
        .map { meta -> [meta, file("./data/sequences/${meta.id}.fastq")] }
        .view(formatForDisplay)        // Named closure

    // Reuse the same closures elsewhere
    ch_filtered = ch_samples
        .filter { meta, file -> meta.quality > 30 }
        .map { meta, file -> addPriority(meta) }  // Reuse closure
        .view(formatForDisplay)                    // Reuse closure
}
main.nf

6.2. Function Composition

Groovy closures can be composed together using the >> operator, allowing you to build complex transformations from simple, reusable pieces:

Function composition means chaining functions together so the output of one becomes the input of the next. The >> operator creates a new closure that applies multiple transformations in sequence.

main.nf
// Simple transformation closures
def normalizeId = { meta ->
    meta + [id: meta.id.toLowerCase().replaceAll(/[^a-z0-9_]/, '_')]
}

def addQualityCategory = { meta ->
    def category = meta.quality > 40 ? 'excellent' :
                  meta.quality > 30 ? 'good' :
                  meta.quality > 20 ? 'acceptable' : 'poor'
    meta + [quality_category: category]
}

def addProcessingFlags = { meta ->
    meta + [
        needs_extra_qc: meta.quality < 30,
        high_priority: meta.organism == 'human' && meta.quality > 35
    ]
}

// Compose transformations using >> operator
def enrichSample = normalizeId >> addQualityCategory >> addProcessingFlags

workflow {
    Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)
        .map(enrichSample)          // Apply composed transformation
        .view { meta ->
            "Processed: ${meta.id} (${meta.quality_category}) - Extra QC: ${meta.needs_extra_qc}"
        }
}
main.nf

6.3. Currying for Specialized Functions

Currying allows you to create specialized versions of general-purpose closures by fixing some of their parameters:

Currying is a technique where you take a function with multiple parameters and create a new function with some of those parameters "fixed" or "pre-filled". This creates specialized versions of general-purpose functions.

main.nf
// General-purpose filtering closure
def qualityFilter = { threshold, meta -> meta.quality >= threshold }

// Create specialized filters using currying
def highQualityFilter = qualityFilter.curry(40)
def standardQualityFilter = qualityFilter.curry(30)

workflow {
    ch_samples = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)

    // Use the specialized filters in different channel operations
    ch_high_quality = ch_samples.filter(highQualityFilter)
    ch_standard_quality = ch_samples.filter(standardQualityFilter)

    // Both channels can be processed differently
    ch_high_quality.view { meta -> "High quality: ${meta.id}" }
    ch_standard_quality.view { meta -> "Standard quality: ${meta.id}" }
}
main.nf

6.4. Closures Accessing External Variables

Closures can access and modify variables from their defining scope, which is useful for collecting statistics:

main.nf
workflow {
    // Variable in the workflow scope
    def sample_count = 0
    def human_samples = 0

    // Closure that accesses and modifies external variables
    def countSamples = { meta ->
        sample_count++  // Modifies external variable
        if (meta.organism == 'human') {
            human_samples++  // Modifies another external variable
        }
        return meta  // Pass data through unchanged
    }

    Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map(extractSampleInfo)
        .map(countSamples)          // Closure modifies external variables
        .collect()                  // Wait for all samples to be processed
        .view {
            "Processing complete: ${sample_count} total samples, ${human_samples} human samples"
        }
}
main.nf

Takeaway

In this section, you've learned:

  • Named closures for eliminating code duplication and improving readability
  • Function composition with >> operator to build complex transformations
  • Currying to create specialized versions of general-purpose closures
  • Variable scope access in closures for collecting statistics and generating reports

These advanced patterns help you write more maintainable, reusable workflows that follow functional programming principles while remaining easy to understand and debug.

With our pipeline now capable of intelligent routing, robust error handling, and advanced functional programming patterns, we're ready for the final enhancement. As your workflows scale to process hundreds or thousands of samples, you'll need sophisticated data processing capabilities that can organize, filter, and analyze large collections efficiently.

The functional programming patterns we just learned work beautifully with Groovy's powerful collection methods. Instead of writing loops and conditional logic, you can chain together expressive operations that clearly describe what you want to accomplish.


7. Collection Operations and File Path Manipulations

7.1. Common Collection Methods in Channel Operations

When processing large datasets, channel operations often need to organize and analyze sample collections. Groovy's collection methods integrate seamlessly with Nextflow channels to provide powerful data processing capabilities:

Groovy provides many built-in methods for working with collections (lists, maps, etc.) that make data processing much more expressive than traditional loops.

main.nf
// Sample data with mixed quality and organisms
def samples = [
    [id: 'sample_001', organism: 'human', quality: 42, files: ['data1.txt', 'data2.txt']],
    [id: 'sample_002', organism: 'mouse', quality: 28, files: ['data3.txt']],
    [id: 'sample_003', organism: 'human', quality: 35, files: ['data4.txt', 'data5.txt', 'data6.txt']],
    [id: 'sample_004', organism: 'rat', quality: 45, files: ['data7.txt']],
    [id: 'sample_005', organism: 'human', quality: 30, files: ['data8.txt', 'data9.txt']]
]

// findAll - filter collections based on conditions
def high_quality_samples = samples.findAll { it.quality > 40 }
println "High quality samples: ${high_quality_samples.collect { it.id }.join(', ')}"

// groupBy - group samples by organism
def samples_by_organism = samples.groupBy { it.organism }
println "Grouping by organism:"
samples_by_organism.each { organism, sample_list ->
    println "  ${organism}: ${sample_list.size()} samples"
}

// unique - get unique organisms
def organisms = samples.collect { it.organism }.unique()
println "Unique organisms: ${organisms.join(', ')}"

// flatten - flatten nested file lists
def all_files = samples.collect { it.files }.flatten()
println "All files: ${all_files.take(5).join(', ')}... (${all_files.size()} total)"

// sort - sort samples by quality
def sorted_by_quality = samples.sort { it.quality }
println "Quality range: ${sorted_by_quality.first().quality} to ${sorted_by_quality.last().quality}"

// reverse - reverse the order
def reverse_quality = samples.sort { it.quality }.reverse()
println "Highest quality first: ${reverse_quality.collect { "${it.id}(${it.quality})" }.join(', ')}"

// count - count items matching condition
def human_samples = samples.count { it.organism == 'human' }
println "Human samples: ${human_samples} out of ${samples.size()}"

// any/every - check conditions across collection
def has_high_quality = samples.any { it.quality > 40 }
def all_have_files = samples.every { it.files.size() > 0 }
println "Has high quality samples: ${has_high_quality}"
println "All samples have files: ${all_have_files}"
main.nf

7.2. File Path Manipulations

Working with file paths is essential in bioinformatics workflows. Groovy provides many useful methods for extracting information from file paths:

main.nf
// File path manipulation examples
def sample_files = [
    '/path/to/data/patient_001_R1.fastq.gz',
    '/path/to/data/patient_001_R2.fastq.gz',
    '/path/to/results/patient_002_analysis.bam',
    '/path/to/configs/experiment_setup.json'
]

sample_files.each { file_path ->
    def f = file(file_path)  // Create Nextflow file object

    println "\nFile: ${file_path}"
    println "  Name: ${f.getName()}"                    // Just filename
    println "  BaseName: ${f.getBaseName()}"            // Filename without extension
    println "  Extension: ${f.getExtension()}"          // File extension
    println "  Parent: ${f.getParent()}"                // Parent directory
    println "  Parent name: ${f.getParent().getName()}" // Just parent directory name

    // Extract sample ID from filename
    def matcher = f.getName() =~ /^(patient_\d+)/
    if (matcher) {
        println "  Sample ID: ${matcher[0][1]}"
    }
}

// Group files by sample ID using path manipulation
def files_by_sample = sample_files
    .findAll { it.contains('patient') }  // Only patient files
    .groupBy { file_path ->
        def filename = file(file_path).getName()
        def matcher = filename =~ /^(patient_\d+)/
        return matcher ? matcher[0][1] : 'unknown'
    }

println "\nFiles grouped by sample:"
files_by_sample.each { sample_id, files ->
    println "  ${sample_id}: ${files.size()} files"
}
main.nf

7.3. The Spread Operator

The spread operator (*.) is a powerful Groovy feature for calling methods on all elements in a collection:

The spread operator (*.) is a shorthand way to call the same method on every element in a collection. It's equivalent to using .collect { it.methodName() } but more concise.

main.nf
// Spread operator examples
def file_paths = [
    '/data/sample1.fastq',
    '/data/sample2.fastq',
    '/results/output1.bam',
    '/results/output2.bam'
]

// Convert to file objects
def files = file_paths.collect { file(it) }

// Using spread operator - equivalent to files.collect { it.getName() }
def filenames = files*.getName()
println "Filenames: ${filenames.join(', ')}"

// Get all parent directories
def parent_dirs = files*.getParent()*.getName()
println "Parent directories: ${parent_dirs.unique().join(', ')}"

// Get all extensions
def extensions = files*.getExtension().unique()
println "File types: ${extensions.join(', ')}"
main.nf

Takeaway

In this section, you've learned:

  • Collection filtering with findAll and conditional logic
  • Grouping and organizing data with groupBy and sort
  • File path manipulation using Nextflow's file object methods
  • Spread operator (*.) for concise collection operations

These patterns help you process and organize complex datasets efficiently, which is essential for handling real-world bioinformatics data.


Summary

Throughout this side quest, you've built a comprehensive sample processing pipeline that evolved from basic metadata handling to a sophisticated, production-ready workflow. Each section built upon the previous, demonstrating how Groovy transforms simple Nextflow workflows into powerful data processing systems.

Here's how we progressively enhanced our pipeline:

  1. Nextflow vs Groovy Boundaries: You learned to distinguish between workflow orchestration (Nextflow) and programming logic (Groovy), including the crucial differences between constructs like collect.

  2. String Processing: You learned regular expressions, parsing functions, and file collection transformation for building dynamic command-line arguments.

  3. Conditional Logic: You added intelligent routing that automatically selects analysis strategies based on sample characteristics like organism, quality scores, and sequencing depth.

  4. Error Handling: You made the pipeline robust by adding validation functions, try-catch error handling, and configuration management with sensible defaults.

  5. Essential Groovy Operators: You mastered safe navigation (?.), Elvis (?:), Groovy Truth, slashy strings, and other key language features that make code more resilient and readable.

  6. Advanced Closures: You learned functional programming techniques including named closures, function composition, currying, and closures with variable scope access for building reusable, maintainable code.

  7. Collection Operations: You added sophisticated data processing capabilities using Groovy collection methods like findAll, groupBy, unique, flatten, and the spread operator to handle large-scale sample processing.

Key Benefits

  • Clearer code: Understanding when to use Nextflow vs Groovy helps you write more organized workflows
  • Better error handling: Basic validation and try-catch patterns help your workflows handle problems gracefully
  • Flexible processing: Conditional logic lets your workflows process different sample types appropriately
  • Configuration management: Using defaults and simple validation makes your workflows easier to use

From Simple to Sophisticated

The pipeline journey you completed demonstrates the evolution from basic data processing to production-ready bioinformatics workflows:

  1. Started simple: Basic CSV processing and metadata extraction with clear Nextflow vs Groovy boundaries
  2. Added intelligence: Dynamic file name parsing with regex patterns and conditional routing based on sample characteristics
  3. Made it robust: Null-safe operators, validation, error handling, and graceful failure management
  4. Made it maintainable: Advanced closure patterns, function composition, and reusable components that eliminate code duplication
  5. Scaled it efficiently: Collection operations for processing hundreds of samples with powerful data filtering and organization

This progression mirrors the real-world evolution of bioinformatics pipelines - from research prototypes handling a few samples to production systems processing thousands of samples across laboratories and institutions. Every challenge you solved and pattern you learned reflects actual problems developers face when scaling Nextflow workflows.

Next Steps

With these Groovy fundamentals mastered, you're ready to:

  • Write cleaner workflows with proper separation between Nextflow and Groovy logic
  • Transform file collections into properly formatted command-line arguments
  • Handle different file naming conventions and input formats gracefully
  • Build reusable, maintainable code using advanced closure patterns and functional programming
  • Process and organize complex datasets using collection operations
  • Add basic validation and error handling to make your workflows more user-friendly

Continue practicing these patterns in your own workflows, and refer to the Groovy documentation when you need to explore more advanced features.

Key Concepts Reference

  • Language Boundaries

    Nextflow vs Groovy examples
    // Nextflow: workflow orchestration
    Channel.fromPath('*.fastq').splitCsv(header: true)
    
    // Groovy: data processing
    sample_data.collect { it.toUpperCase() }
    

  • String Processing

    String processing examples
    // Pattern matching
    filename =~ ~/^(\w+)_(\w+)_(\d+)\.fastq$/
    
    // Function with conditional return
    def parseSample(filename) {
        def matcher = filename =~ pattern
        return matcher ? [valid: true, data: matcher[0]] : [valid: false]
    }
    
    // File collection to command arguments (in process script block)
    script:
    def file_args = input_files.collect { file -> "--input ${file}" }.join(' ')
    """
    analysis_tool ${file_args} --output results.txt
    """
    

  • Error Handling

    Error handling patterns
    try {
        def errors = validateSample(sample)
        if (errors) throw new RuntimeException("Invalid: ${errors.join(', ')}")
    } catch (Exception e) {
        println "Error: ${e.message}"
    }
    

  • Essential Groovy Operators

    Essential operators examples
    // Safe navigation and Elvis operators
    def id = data?.sample?.id ?: 'unknown'
    if (sample.files) println "Has files"  // Groovy Truth
    
    // Slashy strings for regex
    def pattern = /^\w+_R[12]\.fastq$/
    def script = """
    echo "Processing ${sample.id}"
    analysis --depth ${depth ?: 1_000_000}
    """
    

  • Advanced Closures

    Advanced closure patterns
    // Named closures and composition
    def enrichData = normalizeId >> addQualityCategory >> addFlags
    def processor = generalFunction.curry(fixedParam)
    
    // Closures with scope access
    def collectStats = { data -> stats.count++; return data }
    

  • Collection Operations

    Collection operations examples
    // Filter, group, and organize data
    def high_quality = samples.findAll { it.quality > 40 }
    def by_organism = samples.groupBy { it.organism }
    def file_names = files*.getName()  // Spread operator
    def all_files = nested_lists.flatten()
    

Resources