Skip to main content

Sync Tasks

A comprehensive guide to using sync tasks in Qarion ETL flows to synchronize files and data between different storage locations.

Overview

Sync tasks allow you to synchronize files and data between different storage locations (S3 buckets, local filesystem, FTP, SFTP, etc.) using the connector system. This enables data replication, backup, and distribution across multiple storage backends.

Quick Start

Basic S3 to S3 Sync

[[tasks]]
id = "sync_s3_buckets"
type = "sync"
name = "Sync S3 Buckets"
description = "Sync data from source bucket to destination bucket"

[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "s3://destination-bucket/data/"
recursive = true

Sync with Pattern Matching

[[tasks]]
id = "sync_csv_files"
type = "sync"
name = "Sync CSV Files"

[tasks.properties]
source_path = "s3://bucket1/data/"
destination_path = "s3://bucket2/data/"
pattern = "*.csv"
recursive = true
overwrite = true

Configuration

Required Properties

  • source_path: Source storage path (e.g., s3://bucket1/path/, /local/path/, ftp://server/path/)
  • destination_path: Destination storage path (e.g., s3://bucket2/path/, /local/path/)

Optional Properties

  • pattern: File pattern to match (e.g., "*.csv", "data_*.parquet", "*.{csv,json}")
  • recursive: Whether to sync recursively (default: false)
  • overwrite: Whether to overwrite existing files (default: true)
  • sync_config: Additional configuration for source and destination connectors

Storage Backends

Sync tasks support all connector types:

S3 to S3

[[tasks]]
id = "s3_sync"
type = "sync"

[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "s3://dest-bucket/data/"
recursive = true

[tasks.properties.sync_config]
source_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
region_name = "us-east-1"
}
destination_config = {
aws_access_key_id = "${credential:aws_access_key_2}"
aws_secret_access_key = "${credential:aws_secret_key_2}"
region_name = "us-west-2"
}

Local to S3

[[tasks]]
id = "local_to_s3"
type = "sync"

[tasks.properties]
source_path = "/local/data/"
destination_path = "s3://backup-bucket/data/"
recursive = true
pattern = "*.parquet"

[tasks.properties.sync_config]
destination_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}

S3 to Local

[[tasks]]
id = "s3_to_local"
type = "sync"

[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "/local/backup/"
recursive = true

[tasks.properties.sync_config]
source_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}

FTP/SFTP Sync

[[tasks]]
id = "ftp_sync"
type = "sync"

[tasks.properties]
source_path = "ftp://ftp.example.com/data/"
destination_path = "s3://backup-bucket/data/"
recursive = true

[tasks.properties.sync_config]
source_config = {
host = "ftp.example.com"
username = "${credential:ftp_username}"
password = "${credential:ftp_password}"
port = 21
}
destination_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}

Cross-Region S3 Sync

[[tasks]]
id = "cross_region_sync"
type = "sync"

[tasks.properties]
source_path = "s3://us-east-1-bucket/data/"
destination_path = "s3://eu-west-1-bucket/data/"
recursive = true

[tasks.properties.sync_config]
source_config = {
region_name = "us-east-1"
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}
destination_config = {
region_name = "eu-west-1"
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}

Pattern Matching

Glob Patterns

[tasks.properties]
pattern = "*.csv" # All CSV files
pattern = "data_*.parquet" # Parquet files starting with "data_"
pattern = "*.{csv,json}" # CSV or JSON files
pattern = "2024-*.csv" # CSV files starting with "2024-"

Recursive Sync

[tasks.properties]
source_path = "s3://bucket/data/"
destination_path = "s3://backup/data/"
recursive = true # Sync all subdirectories

Overwrite Behavior

Overwrite Existing Files (Default)

[tasks.properties]
overwrite = true # Always overwrite destination files

Skip Existing Files

[tasks.properties]
overwrite = false # Skip files that already exist in destination

Examples

Example 1: Daily Backup to S3

[[tasks]]
id = "daily_backup"
type = "sync"
name = "Daily Backup"
description = "Backup local data to S3 daily"

[tasks.properties]
source_path = "/data/production/"
destination_path = "s3://backup-bucket/daily/{{ execution_date }}/"
recursive = true
overwrite = true

[tasks.properties.sync_config]
destination_config = {
aws_access_key_id = "${credential:backup_aws_key}"
aws_secret_access_key = "${credential:backup_aws_secret}"
region_name = "us-west-2"
}

Example 2: Sync Specific File Types

[[tasks]]
id = "sync_parquet"
type = "sync"
name = "Sync Parquet Files"

[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "s3://analytics-bucket/data/"
pattern = "*.parquet"
recursive = true
overwrite = false # Don't overwrite existing files

Example 3: Multi-Bucket Replication

[[tasks]]
id = "replicate_to_regions"
type = "sync"
name = "Replicate to US West"

[tasks.properties]
source_path = "s3://primary-bucket/data/"
destination_path = "s3://us-west-bucket/data/"
recursive = true

[[tasks]]
id = "replicate_to_eu"
type = "sync"
name = "Replicate to EU"
dependencies = ["replicate_to_regions"]

[tasks.properties]
source_path = "s3://primary-bucket/data/"
destination_path = "s3://eu-bucket/data/"
recursive = true

Example 4: FTP to S3 Archive

[[tasks]]
id = "ftp_archive"
type = "sync"
name = "Archive FTP to S3"

[tasks.properties]
source_path = "ftp://legacy-server/data/"
destination_path = "s3://archive-bucket/ftp-backup/{{ execution_date }}/"
recursive = true
pattern = "*.csv"

[tasks.properties.sync_config]
source_config = {
host = "legacy.example.com"
username = "${credential:ftp_username}"
password = "${credential:ftp_password}"
}
destination_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}

Path Handling

Directory Sync

When syncing directories, the relative path structure is preserved:

Source: s3://bucket1/data/
- file1.csv
- subdir/file2.csv

Destination: s3://bucket2/backup/
Result:
- backup/file1.csv
- backup/subdir/file2.csv

Single File Sync

[tasks.properties]
source_path = "s3://bucket1/data/file.csv"
destination_path = "s3://bucket2/data/file.csv"

Best Practices

  1. Use Credential References:

    aws_access_key_id = "${credential:aws_access_key}"

    Never hardcode credentials in flow definitions.

  2. Use Patterns for Selective Sync:

    pattern = "*.parquet"  # Only sync Parquet files
  3. Set Overwrite Appropriately:

    • Use overwrite = true for backups and replication
    • Use overwrite = false for incremental syncs
  4. Use Recursive for Directories:

    recursive = true  # Sync all subdirectories
  5. Monitor Sync Results:

    • Check files_synced count
    • Review files_skipped for overwrite conflicts
    • Monitor errors in execution results
  6. Use Date Templates:

    destination_path = "s3://bucket/backup/{{ execution_date }}/"

Troubleshooting

No Files Synced

  1. Check Source Path:

    • Verify source path exists
    • Check path format (trailing slash for directories)
    • Verify connector credentials
  2. Check Pattern:

    • Ensure pattern matches file names
    • Test pattern with list_files first
  3. Check Permissions:

    • Verify read access to source
    • Verify write access to destination

Files Skipped

  1. Check Overwrite Setting:

    overwrite = true  # Force overwrite
  2. Check Destination:

    • Verify destination path is correct
    • Check if files already exist

Performance Issues

  1. Use Patterns:

    • Sync only needed file types
    • Avoid syncing unnecessary files
  2. Batch Processing:

    • Sync in smaller batches for large directories
    • Use multiple sync tasks for parallel processing
  3. Direct Copy:

    • Same connector types may use direct copy (faster)
    • Different types use download/upload (slower)