Cloud-Based Processing using HCP Pipelines and Amazon Web Services

The goal of this tutorial is for the reader to gain experience with running the HCP pipelines in the “Amazon Cloud”.

Step 11a: Supply StarCluster with your AWS credentials
Step 11b: Creating an Amazon EC2 key pair
Step 11c: Start an example cluster
Step 11d: Navigate your example cluster
    Example StarCluster Commands
Step 11e: Terminate your small cluster
Step 11f: Create an instance to use as a model for your pipeline cluster nodes
Step 11g: Further prepare your new instance for StarCluster use
    Turn off the software firewall on your PipelineNodeTemplate instance
    Delete gridengine software from your PipelineNodeTemplate instance
    Delete the sgeadmin account and group.
    Remove the SGE_ROOT setting in the /etc/profile file
Step 11h: Install SGE files
    Create a compressed tar file containing what StarCluster needs
    Copy the compressed tar file you just made to your local machine
    Copy the compressed tar file from your local machine to your PipelineNodeTemplate instance
    Unpack the compressed tar file and copy its contents to where StarCluster expects it
    Terminate the instance you just created based on the StarCluster AMI
Step 11i: Create an EBS volume to hold data to be shared across your cluster
Step 11j: Create an AMI for cluster nodes
Step 11k: Configure and Start a Pipeline Cluster 

Step 12: Getting the HCP OpenAccess data available to your cluster

Step 12a: Setting up s3cmd on your master node
Step 12b: Retrieving data to process from the HCP OpenAccess S3 Bucket

Step 13: Editing files to run a pipeline stage

Step 14: Starting up a set of PreFreeSurfer Pipeline jobs

Step 15: Using the StarCluster load balancer

Step 16: Using spot instances as worker nodes

Links and references

Terms and Acronyms

The goal of this tutorial is for the reader to gain experience with running the HCP pipelines in the “Amazon Cloud”. In order for this to make sense, it is important that you start out with a basic understanding of the following terms.

AWS - Amazon Web Services

A collection of remote computing services that make up a cloud computing platform. The two of the central services are Amazon EC2 (the service that provides compute power, “machines” that are remotely available) and Amazon S3 (the service that provides storage space for your data).

EC2 – Elastic Compute Cloud

Amazon service that allows users to rent virtual machines (VMs) on which to run their applications. Users can create, launch, and terminate VMs as needed, paying an hourly fee only for VMs that are currently active (this the “elastic” nature).

S3 – Simple Storage Service

Amazon online data storage service. Not a traditional file system. Stores large “objects” instead of files. These objects are accessible virtually anywhere on the web. Multiple running EC2 instances can access an S3 object simultaneously. Intended for large, shared pools of data. Conceptually similar to a shared, web-accessible drive.

S3 Bucket

Data in S3 is stored in buckets. Forour purposes, a bucket is simply a named container for the files that we store and share via Amazon S3. HCP’s data is made available publicly in a bucket named hcp-openaccess.

AMI – Amazon Machine Image – The Software

A read-only image of a file system that includes an Operating System (OS) and additional software installed. Conceptually, this is comparable to a CD/DVD that contains an OS and other software that is installed on a “machine” for you. The creator of the AMI chooses which OS to include and then installs and configures other software. For example, an AMI creator might choose to start with CentOS Linux or Ubuntu Linux and then pre-install a set of tools that are useful for a particular purpose.

An AMI might be created for Photo Editing which would contain a pre-installed suite of software that the AMI creator deems is useful for Photo Editing.

An AMI might be created for Neuroimaging with a chosen OS (e.g. Ubuntu 12.04.1 LTS) and a pre-installed suite of software for Neuroimaging (e.g. FSL, AFNI, FreeSurfer, the HCP Pipelines, Workbench, etc.)

The AMI is the software distribution that will be installed and run on your virtual machine instance (see below.)

Amazon EC2 Instance Types – The available hardware

An EC2 Instance Type is a particular combination of CPU, memory (RAM), storage, and networking capacity optimized for a particular purpose. There are instance types defined for use as:

General Purpose systems
Compute Optimized (e.g. high performance) systems
Memory Optimized systems
GPU application systems
Storage Optimized (high I/O) systems

An Instance Type is a virtual hardware configuration.

Amazon EBS – Elastic Block Storage

Online data storage service that provides a more traditional file system. An EBS volume is attached to a running EC2 instance. From the EC2 instance’s point of view, an EBS volume is a “local drive”.

EBS volumes can be configured such that the data continues to exist after the EC2 instance is shut down. By default, however, they are configured such that the volume is deleted upon instance shut down.

NITRC

Neuroimaging Informatics Tools and Resources Clearinghouse