Command Line & Shell for HPC

Syntax, scripting, and automation patterns used frequently in high performance computing environments for reducing experimentation time, workflow repetition, and data processing overhead

Common Research Use Cases

‧ automation ‧ monitoring ‧ benchmarking ‧ optimization ‧ standardization ‧ data processing

Shell Essentials

Output Redirection

Linux shell environments provide a set of operators that can be used to change where command output is sent. By default, commands send output to the screen.

More on Descriptors

Most commands, but not all, will print the result of internal processes that complete succesfully as formatted text.

These messages are sent to a special file called a descriptor, which can be referenced with a reserved file system path or numeric identifier.

Unless explicitly specified commands will default to stdout for results and information or stderr for error messages.

stdin : user or text input : 0 : /dev/stdin stdout : error output : 2 : /dev/stdout stderr : error output : 1 : /dev/stderr

Because each descriptor has its own associated path, they independently send and receive messages. As a result, it is important to note that stderr messages are typically not interpreted as input to redirect operators by default.


To change the behavior of a particular descriptor, the following shorthand patterns can be used:

ls 2>&1             # redirect all potential command output
ls >&1              # same as above, but a little shorter
ls >&2              # redirect all output as an error
ls 2>/dev/null      # silence error messages
ls 2>&1 >/dev/null  # silence all messages, no output
ls 1>/dev/null      # only print errors

The following operators can be used to redirect output from stderr andor stdout for capturing output to file(s).

This is an essential pattern for HPC research, because it ensures that data persisted, and enables control over where results from experimental runs are stored in the filesystem.

#append output to existing file (will create a file if it does not exist)

ls >> ~/files.txt

#write\create file (destructive)

ls > ~/files.txt

Pipes

pipes send command output as input to another command:

whoami | id

Here we run the whoami command, which by itself prints your username to the screen. But, using a pipe | we feed that output into the id command which prints full account information. You can chain pipes together to send output through many different commands.

Variables

variables allow you to store and reference values in a shell session or script:

greeting="hello"
subject="world"
echo "${greeting}, ${subject}!"

Inline Commands

Multiple commands can be run on the same shell prompt or script line by separating them with a semicolon, or you can break long command syntax into a more readable form with

whoami; id

echo “this prints a really long message to the screen, but might be easier to read if we break it into multiple lines.”

While not most useful example, here we use the echo command to demonstrate splitting string input (in a shell strings are anythig enclosed in ” ” or ‘ ‘) into multiple lines.

A more useful example might be splitting a command that contains many parameters that are more easily parsed by breaking them into multiple lines:

command --with "many" \
--parameters "that contain" \
--values "that are" --more "readable" \
--when "broken up" \
--into "multiple lines"

Loops

Loops are a very useful construct that allow an operation to be performed on lists or arrays of data. There are numerous use cases for loops in HPC, especially for file and data processing. A simple example to demonstrate this:

for number in "1 2 3 4 5"; do
  echo "line ${number}"
done

Subshells

The general syntax is: $(<command> <required_parameters> [optional_parameters])

clustername=$(hostname)
files=$(ls -1 ${HOME})

Arrays

Placeholder

System Metadata

Placeholder

Time

man date
man -k date
man Date::Format
date +'%m%d%Y'
date +'%H%M%S'

Downloading

Use a public database or API to collect data directly from the cluster command line

# The Consumer Complaint Database is a public source for interesting data in various formats'
# It allows parameters to be passed with simple HTTP GET methods

# https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?format=csv&date_received_max=2023-04-01&date_received_min=2023-01-01
# https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?limit=1000&format=csv&date_received_min=2023-03-01

datestamp=$(date +'%m%d%Y')
timestamp=$(date +'%H%M%S')

curl -o ./data/raw/complaints.${datestamp}.${timestamp}.csv https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?format=csv&date_received_max=2023-01-01&date_received_min=2023-01-01

ls -al data/raw
wc -l ./data/raw/complaints.04062023.081145.csv

tail -1 ./data/raw/complaints.04062023.081145.csv
tail -10  ./data/raw/complaints.04062023.081145.csv | awk -F',' '{print $1}'
tail -10  data/raw/complaints.04062023.081145.csv | cut -d',' -f1
tail -10  data/raw/complaints.04062023.081145.csv | grep -ve "^[A-Za-z]" | cut -d',' -f1

awk -v FPAT='(".+")||([^,]+)||(^[ ]*$)'