Sync Tasks
A comprehensive guide to using sync tasks in Qarion ETL flows to synchronize files and data between different storage locations.
Overview
Sync tasks allow you to synchronize files and data between different storage locations (S3 buckets, local filesystem, FTP, SFTP, etc.) using the connector system. This enables data replication, backup, and distribution across multiple storage backends.
Quick Start
Basic S3 to S3 Sync
[[tasks]]
id = "sync_s3_buckets"
type = "sync"
name = "Sync S3 Buckets"
description = "Sync data from source bucket to destination bucket"
[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "s3://destination-bucket/data/"
recursive = true
Sync with Pattern Matching
[[tasks]]
id = "sync_csv_files"
type = "sync"
name = "Sync CSV Files"
[tasks.properties]
source_path = "s3://bucket1/data/"
destination_path = "s3://bucket2/data/"
pattern = "*.csv"
recursive = true
overwrite = true
Configuration
Required Properties
source_path: Source storage path (e.g.,s3://bucket1/path/,/local/path/,ftp://server/path/)destination_path: Destination storage path (e.g.,s3://bucket2/path/,/local/path/)
Optional Properties
pattern: File pattern to match (e.g.,"*.csv","data_*.parquet","*.{csv,json}")recursive: Whether to sync recursively (default:false)overwrite: Whether to overwrite existing files (default:true)sync_config: Additional configuration for source and destination connectors
Storage Backends
Sync tasks support all connector types:
S3 to S3
[[tasks]]
id = "s3_sync"
type = "sync"
[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "s3://dest-bucket/data/"
recursive = true
[tasks.properties.sync_config]
source_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
region_name = "us-east-1"
}
destination_config = {
aws_access_key_id = "${credential:aws_access_key_2}"
aws_secret_access_key = "${credential:aws_secret_key_2}"
region_name = "us-west-2"
}
Local to S3
[[tasks]]
id = "local_to_s3"
type = "sync"
[tasks.properties]
source_path = "/local/data/"
destination_path = "s3://backup-bucket/data/"
recursive = true
pattern = "*.parquet"
[tasks.properties.sync_config]
destination_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}
S3 to Local
[[tasks]]
id = "s3_to_local"
type = "sync"
[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "/local/backup/"
recursive = true
[tasks.properties.sync_config]
source_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}
FTP/SFTP Sync
[[tasks]]
id = "ftp_sync"
type = "sync"
[tasks.properties]
source_path = "ftp://ftp.example.com/data/"
destination_path = "s3://backup-bucket/data/"
recursive = true
[tasks.properties.sync_config]
source_config = {
host = "ftp.example.com"
username = "${credential:ftp_username}"
password = "${credential:ftp_password}"
port = 21
}
destination_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}
Cross-Region S3 Sync
[[tasks]]
id = "cross_region_sync"
type = "sync"
[tasks.properties]
source_path = "s3://us-east-1-bucket/data/"
destination_path = "s3://eu-west-1-bucket/data/"
recursive = true
[tasks.properties.sync_config]
source_config = {
region_name = "us-east-1"
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}
destination_config = {
region_name = "eu-west-1"
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}
Pattern Matching
Glob Patterns
[tasks.properties]
pattern = "*.csv" # All CSV files
pattern = "data_*.parquet" # Parquet files starting with "data_"
pattern = "*.{csv,json}" # CSV or JSON files
pattern = "2024-*.csv" # CSV files starting with "2024-"
Recursive Sync
[tasks.properties]
source_path = "s3://bucket/data/"
destination_path = "s3://backup/data/"
recursive = true # Sync all subdirectories
Overwrite Behavior
Overwrite Existing Files (Default)
[tasks.properties]
overwrite = true # Always overwrite destination files
Skip Existing Files
[tasks.properties]
overwrite = false # Skip files that already exist in destination
Examples
Example 1: Daily Backup to S3
[[tasks]]
id = "daily_backup"
type = "sync"
name = "Daily Backup"
description = "Backup local data to S3 daily"
[tasks.properties]
source_path = "/data/production/"
destination_path = "s3://backup-bucket/daily/{{ execution_date }}/"
recursive = true
overwrite = true
[tasks.properties.sync_config]
destination_config = {
aws_access_key_id = "${credential:backup_aws_key}"
aws_secret_access_key = "${credential:backup_aws_secret}"
region_name = "us-west-2"
}
Example 2: Sync Specific File Types
[[tasks]]
id = "sync_parquet"
type = "sync"
name = "Sync Parquet Files"
[tasks.properties]
source_path = "s3://source-bucket/data/"
destination_path = "s3://analytics-bucket/data/"
pattern = "*.parquet"
recursive = true
overwrite = false # Don't overwrite existing files
Example 3: Multi-Bucket Replication
[[tasks]]
id = "replicate_to_regions"
type = "sync"
name = "Replicate to US West"
[tasks.properties]
source_path = "s3://primary-bucket/data/"
destination_path = "s3://us-west-bucket/data/"
recursive = true
[[tasks]]
id = "replicate_to_eu"
type = "sync"
name = "Replicate to EU"
dependencies = ["replicate_to_regions"]
[tasks.properties]
source_path = "s3://primary-bucket/data/"
destination_path = "s3://eu-bucket/data/"
recursive = true
Example 4: FTP to S3 Archive
[[tasks]]
id = "ftp_archive"
type = "sync"
name = "Archive FTP to S3"
[tasks.properties]
source_path = "ftp://legacy-server/data/"
destination_path = "s3://archive-bucket/ftp-backup/{{ execution_date }}/"
recursive = true
pattern = "*.csv"
[tasks.properties.sync_config]
source_config = {
host = "legacy.example.com"
username = "${credential:ftp_username}"
password = "${credential:ftp_password}"
}
destination_config = {
aws_access_key_id = "${credential:aws_access_key}"
aws_secret_access_key = "${credential:aws_secret_key}"
}
Path Handling
Directory Sync
When syncing directories, the relative path structure is preserved:
Source: s3://bucket1/data/
- file1.csv
- subdir/file2.csv
Destination: s3://bucket2/backup/
Result:
- backup/file1.csv
- backup/subdir/file2.csv
Single File Sync
[tasks.properties]
source_path = "s3://bucket1/data/file.csv"
destination_path = "s3://bucket2/data/file.csv"
Best Practices
-
Use Credential References:
aws_access_key_id = "${credential:aws_access_key}"Never hardcode credentials in flow definitions.
-
Use Patterns for Selective Sync:
pattern = "*.parquet" # Only sync Parquet files -
Set Overwrite Appropriately:
- Use
overwrite = truefor backups and replication - Use
overwrite = falsefor incremental syncs
- Use
-
Use Recursive for Directories:
recursive = true # Sync all subdirectories -
Monitor Sync Results:
- Check
files_syncedcount - Review
files_skippedfor overwrite conflicts - Monitor errors in execution results
- Check
-
Use Date Templates:
destination_path = "s3://bucket/backup/{{ execution_date }}/"
Troubleshooting
No Files Synced
-
Check Source Path:
- Verify source path exists
- Check path format (trailing slash for directories)
- Verify connector credentials
-
Check Pattern:
- Ensure pattern matches file names
- Test pattern with
list_filesfirst
-
Check Permissions:
- Verify read access to source
- Verify write access to destination
Files Skipped
-
Check Overwrite Setting:
overwrite = true # Force overwrite -
Check Destination:
- Verify destination path is correct
- Check if files already exist
Performance Issues
-
Use Patterns:
- Sync only needed file types
- Avoid syncing unnecessary files
-
Batch Processing:
- Sync in smaller batches for large directories
- Use multiple sync tasks for parallel processing
-
Direct Copy:
- Same connector types may use direct copy (faster)
- Different types use download/upload (slower)