Overview
This manual provides reference material for Swift: the SwiftScript language and the Swift runtime system. For introductory material, consult the Swift tutorial.
Swift is a data-oriented coarse grained scripting language that supports dataset typing and mapping, dataset iteration, conditional branching, and procedural composition.
Swift programs (or workflows) are written in a language called SwiftScript.
SwiftScript programs are dataflow oriented - they are primarily concerned with processing (possibly large) collections of data files, by invoking programs to do that processing. Swift handles execution of such programs on remote sites by choosing sites, handling the staging of input and output files to and from the chosen sites and remote execution of program code.
Collective Data Management
Overview
The basic idea with Collective Data Management (CDM) is to reconfigure the way a Swift script accesses data by:
-
Moving user data using efficient, possibly site-specific techniques; and
-
Dynamically renaming user data files so that the the Swift script works without modifying it.
The data movement technique is called the CDM "policy". Renaming and policy selection are handled in an optional file called the CDM file.
Key usage points
-
The user specifies a CDM policy in a file, conventionally called fs.data.
-
fs.data is given to Swift on the command line.
-
At job launch time, the Swift job launch procedures query the CDM policy,
-
altering the file staging phase, and
-
sending fs.data to the compute site.
-
-
At job run time, the Swift wrapper script
-
consults a Perl script to obtain policy, and
-
uses wrapper extensions to modify data movement.
-
-
Similarly, stage out can be changed.
Example command line
swift -sites.file sites.xml -tc.file tc.data -cdm.file fs.data stream.swift
CDM policy file format
Example:
# Describe CDM for my job property GATHER_LIMIT 1 rule .*input.txt DIRECT /gpfs/homes/wozniak/data rule .*xfile*.data BROADCAST /dev/shm rule .* DEFAULT
The lines contain:
-
A directive, either rule or property
-
A rule has:
-
A regular expression
-
A policy token
-
-
Additional policy-specific arguments
-
A property has
-
A policy property token
-
The token value
-
-
-
Comments with # .
-
Blank lines are ignored.
Notes:
-
The policy file is used as a lookup database by Swift and Perl methods.
-
For example, a lookup with the database above given the argument input.txt would result in the DIRECT policy.
-
If the lookup does not succeed, the result is DEFAULT.
-
Policies are listed as subclasses of org.globus.swift.data.policy.Policy.
Policy descriptions
DEFAULT
-
Just use file staging as provided by Swift/CoG. Identical to behavior if no CDM file is given.
BROADCAST
rule .*xfile*.data BROADCAST /dev/shm
-
The input files matching the pattern will be stored in the given directory, an LFS location, with links in the job directory.
-
On the BG/P, this will make use of the f2cn tool.
-
On machines not implementing an efficient broadcast, we will just ensure correctness. For example, on a workstation, the local location could be in a /tmp RAM FS, and we would just execute a shell function to get the data there via dd.
DIRECT
rule .*input.txt DIRECT /gpfs/scratch/wozniak
-
Allows for direct I/O to the parallel FS without staging.
-
The input files matching the pattern must already exist in the given directory, a GFS location. Links will be placed in the job directory.
-
The output files matching the pattern will be stored in the given directory, with links in the job directory.
-
Example: In the rule above, the Swift-generated file name ./data/input.txt would be accessed by the user job in /gpfs/homes/wozniak/data/input.txt .
LOCAL
rule .*input.txt LOCAL dd /gpfs/homes/wozniak/data obs=64K
-
Allows for client-directed input copy to the compute node.
-
The user may specify cp or dd as the input transfer program.
-
The input files matching the pattern must already exist in the given directory, a GFS location. Copies will be placed in the job directory.
-
Argument list: [tool] [GFS directory] [tool arguments]*
GATHER
property GATHER_LIMIT 500000000 # 500 MB property GATHER_DIR /dev/shm/gather property GATHER_TARGET /gpfs/wozniak/data/gather_target rule .*.output.txt GATHER
-
The output files matching the pattern will be present to tasks in the job directory as usual but noted in a _swiftwrap shell array GATHER_OUTPUT.
-
The GATHER_OUTPUT files will be cached in the GATHER_DIR, an LFS location.
-
The cache will be flushed when a job ends if a du on GATHER_DIR exceeds GATHER_LIMIT.
-
As the cache fills or on stage out, the files will be bundled into randomly named tarballs in GATHER_TARGET, a GFS location.
-
If the compute node is an SMP, GATHER_DIR is a shared resource. It is protected by the link file GATHER_DIR/.cdm.lock .
-
Unpacking the tarballs in GATHER_TARGET will produce the user-specified filenames.
Internal mechanism notes
Summary:
-
Files are created by application
-
Acquire lock
-
Move files to cache
-
Check cache size
-
If limit exceeded, move all cache files to outbox
-
Release lock
-
If limit was exceeded, stream outbox as tarball to target
-
GATHER required quite a bit of shell functionality to manage the lock, etc. This is placed in cdm_lib.sh.
-
vdl_int.k needed an additional task submission (cdm_cleanup.sh) to perform the final flush at workflow completion time. This task also uses cdm_lib.sh.
-
CDM Internals
VDL/Karajan processing
-
CDM functions are available in Karajan via the cdm namespace.
-
These functions are defined in org.globus.swift.data.Query.
-
If CDM is enabled, VDL skips file staging for files unless the policy is DEFAULT.
Swift wrapper CDM routines
-
The cdm.pl script is shipped to the compute node if CDM is enabled.
-
When linking in inputs, CDM is consulted by _swiftwrap:cdm_lookup().
-
The cdm_action() shell function handles CDM methods.
Test cases
(See About.txt for more information.)
-
Simple test cases are in:
-
Do a
mkdir cdm cd cdm svn co https://svn.mcs.anl.gov/repos/wozniak/collab/cdm/scripts
-
In cdm-direct/, run
source ./setup.sh local local local
-
Run workflow:
swift -sites.file sites.xml -tc.file tc.data -cdm.file fs.data stream.swift
-
Note in the log that staging is skipped for input.txt:
policy: file://localhost/input.txt : DIRECT FILE_STAGE_IN_START file=input.txt ... FILE_STAGE_IN_SKIP file=input.txt policy=DIRECT FILE_STAGE_IN_END file=input.txt ...
-
In the wrapper output, the input file is handled by CDM functionality:
Progress 2010-01-21 13:50:32.466572727-0600 LINK_INPUTS CDM_POLICY: DIRECT /homes/wozniak/cdm/scripts/cdm-direct CDM: jobs/t/cp_sh-tkul4nmj input.txt DIRECT /homes/wozniak/cdm/scripts/cdm-direct CDM[DIRECT]: Linking jobs/t/cp_sh-tkul4nmj/input.txt to /homes/wozniak/cdm/scripts/cdm-direct/input.txt Progress 2010-01-21 13:50:32.486016708-0600 EXECUTE
-
all-pairs is quite similar but uses more policies.
PTMap case
-
Start with vanilla PTMap:
cd cdm mkdir apps cd apps https://svn.mcs.anl.gov/repos/wozniak/collab/cdm/apps/ptmap
-
Source setup.sh.
-
Run start.sh.
Appendix A: Index of Swift input and output files
This documents all files read and written by Swift and its underlying components.
Inputs
-
The Swift script
-
sites.file
-
tc.file
-
log4j.properties
-
swift.properties
-
Provider properties files
Outputs
Standard output
-
This is the result of the Swift’s log4-based logger WARN-level messages.
-
Contains output from HangChecker, …
Swift logfile
-
(*.log)
-
This is the result of other Swift logger messages.
swift.log
Wrapper logs
-
*-info - in the work directory, also called the sitedir.
-
This contains output from _swiftwrap. CDM problems can often be diagnosed here.
Coasters worker logs
-
*.log