Python Fundamentals Training

Presentation Slides
Github Repository


Introduction

When you first login to Easley, the system python is available by default. However this may not be the ideal version of python for your workload. The following commands will return the version of python and the path where the version of python is installed.

The first command below will give you details about the version of python

python "-V"
Python 2.7.5

The second command will give you the path.

which python
/usr/bin/python

By default python 2.7.5 is available for use. However, if you wish to choose another version of python or prefer to use anaconda, run the following command:

module av
------------- Languages & Environments --------
python/anaconda/2.7.14     python/anaconda/3.6.3    python/anaconda/3.7.0    python/intel/3.7.9
python/anaconda/2.7.15     python/anaconda/3.6.4    python/anaconda/3.7.4    python/3.8.6
python/anaconda/3.5.2-0    python/anaconda/3.6.5    python/anaconda/3.8.6    python/3.9.2 (D)

Under Languages and Environments, you will see a variety of python modules to choose from. The default python will be python/3.9.2

module load python
python -V
Python 3.9.2
which python
/tools/python-3.9.2/bin/python

Listing Available packages

The two most popular package managers for installing and listing packages are pip and conda. There are several system wide packages installed and available for use. They vary depending on the version of anaconda or python you load. To check for the list of packages available use the following command.

For python modules use the following:

module load python
pip list installed

For anaconda/python modules use the following command:

module unload python
module load python/anaconda/3.8.6
conda list

Python or Anaconda?

Anaconda is an opensource Python distribution that contains hundreds of data science libraries. It may be beneficial for users who specialize in data science to use the python anaconda modules.

Virtual Environments and installing packages locally

If you load a version of python that does not include a package nessessary for your workload you can submit a request with our office. However you can also install the package(s) locally. To work around the problems with limited privileges, it is possible to use your home directory /home/<userid> where you have full rights, to install special entire instances of Python.

pip install --user package_name

However, if you are planning on using multiple versions of the same package for different projects, you may run into dependency issues. Virtual environments can be used to solve this issue. Virtual environments are recommended for python-based projects, and it is recommended that you create a new environment for every project.

The Python virtualenv module allows us to create encapsulated Python instance where we can perform module installations without administrative privileges and avoid software conflicts. The virtualenv feature actually creates a copy of the core binaries and libraries needed to run a Python program, including an isolated site-packages location in your home directory (or other specified location to which you have write permissions) where modules are typically installed. This feature is very helpful for shared systems like HPC clusters so that you can customize Python for your research workflow without administrative privileges.

Python 3 with pip

This section will demonstrate how to set up a virtual environment using python3 with the pip package manager.

Step 1: Get Home

First change into your home directory by typing the following command:

cd

You can always use the following command to check the current directory you are in.

pwd

Step 2: Set the Python Environment

Before creating the virtual environment, check which modules are loaded using module list. Load any additional modules needed for your workflow with module load, and unload any unneccessary modules with module unload.

For this example, we use default (latest) version of Python …

module load python

Most versions of Python available on HPC will have the virtualenv feature available for use globally, but it’s probably still a good idea to create an instance of the feature in our home directory. The following pip command uses the --user option, which will install virtualenv to a location in our home directory.

pip3 install --user virtualenv

If the command is successful, you should be able to see what files were installed by looking into a special hidden path (“.local”) in your home directory. The pip command, when used with --user, creates this path to hold user specific python files and configuration items …

ls -al ~/.local/lib/python-<version>/site-packages

Caution

pip3 install --user <package_name> is similar to python virtual environments, in that it creates a location in your home directory where you can install modules. However, it’s important to note that simply using pip --user does not generate all of the necessary changes needed to work in a fully isolated virtual environment.

Step 3: Create the Virtual Environment

With virtualenv installed, we need to create a new location in our home directory to hold the virtual environment files. You may end up with multiple virtual environments for various projects, so it’s a good idea give the new path a descriptive name (e.g. pytorch_project1). Here we just call our environment “env1” as a generic identifier, but you can name the path whatever you like.

mkdir ~/env1
cd ~/env1/

From within the new directory, we can now create a new virtual environment by telling python to copy all of the essential files to a location within our new project folder (“env1”). We use “env1_python” here, but you can name it whatever you want …

python3 -m virtualenv env1_python

Because Python has to perform a number of file copies in order to create the virtual environment, this command may take a little while to complete. We can see what has been copied by looking into the newly created “env1_python” directory …

ls -al env1_python
ls -al env1_python/bin
ls -al env1_python/lib/python<version>

Step 4: Activate the Virtual Environment

You may notice that several “activate” scripts have been created for us in env1_python/bin. These files perform all of the necessary tasks to ensure that your Linux shell is configured with all of the appropriate paths, libraries, etc. needed to run in the isolated virtual environment. In order to actually switch to this specific execution context, we use the source command along with the location of the activate script …

source env1_python/bin/activate

With the environment now set after calling source, you should see that your prompt has changed to indicate the name of the virtual environment …

(env1_python) hpcuser@easley01:env1_python >
(env1_python) hpcuser@easley01:env1_python > which python
/home/hpcuser/env1/env1_python/bin/python

Step 5: Install Packages

From within this new Python execution context, we can now perform operations like we normally would for a locally installed python. Most importantly, you can now install packages using pip …

(env1_python) hpcuser@easley01:env1_python > pip3 install <package_name>
(env1_python) hpcuser@easley01:env1_python > python3

To confirm, we can take a look in the virtual environments dedicated site-packages directory, and we should see that any modules we have installed while activated are placed in that location …

(env1_python) hpcuser@easley01:env1_python > ls env/lib/python3.9/site-packages/

And, we can run interactively in our customized shell to test some code …

(env1_python) hpcuser@easley01:env1_python > python3

Before creating the virtual environment, check which modules are loaded, and load any additional modules needed. For this example use the latest version of Python 3. Move to the next section if you plan on using conda as the package manager.

Step 6: Deactivate Virtual Environment

Final step is to deactivate the virtual environment using the following steps below:

(env1_python) hpcuser@easley01:env1_python > deactivate env1_python

Anaconda Python 3 with conda

Anaconda is a special distribution of Python that is purposed for scientific workloads. Anaconda claims to provide everything you need for data science development and experimentation, including it’s own specialized package manager. Some researchers prefer to use Anaconda for their Python workloads. The following steps describe the recommended method for creating an Anaconda (conda) virtual environment …

Step 1: Housekeeping

First, ensure that you are in your home directory and have a clean environment, with no other Python-specific modules loaded…

cd
module list

If you see any Python modules loaded, it might be a good idea to log out, then back in to reset your environment.

Step 2: Load Anaconda

Once you have confirmed that your environment is clean, you can load one of the Anaconda Python modules from which you can begin configuring your virtual environment. Here, we load the default (latest) version…

module av python

Command Line Interpreter: bash

(env2) hpcuser@easley01:~ > module av python

- - - - - - - - - - - - Programming Languages & Environments - - - - - - - - - - - -
python/3.8.6 python/anaconda/2.7.14 python/anaconda/3.7.0 …
python/3.9.2 python/anaconda/2.7.15 python/anaconda/3.7.4 python/intel/3.7.9

module load python/anaconda

Step 3: Create a New Virtual Environment

With Anaconda Python loaded, your environment should now be set to use the Python binaries. Because the Python installation is purposed for a multi-user environment, the path for the default package location is not writable by normal users. In order to use our own location for packages, we need to set the CONDA_PKGS_DIRS environment variable so that conda does not attempt to write to the shared location…

export CONDA_PKGS_DIRS=~/.conda/pkgs

Now we can create a new virtual environment using conda create using the -n argument to give it a specific name. For this example, the virtual environment name will be env1. …

conda create -n env1

We can see from the output that conda wants to create the environment in a location in our home directory ~/.conda/envs/env1. If conda create was successful, we should see some new files and directories in that location …

Command Line Interpreter: bash

$ ls ~/.conda/
environments.txt envs pkgs

$ ls ~/.conda/envs/env1/
conda-meta

Step 4: Activate the Environment

Also from the conda create output, we can see that Anaconda Python has advised us to use conda activate env1 to use the environment. However, if you have not yet used Anaconda Python, you might see an error when running that command …

Command Line Interpreter: bash

$ conda activate env1

CommandNotFoundError: Your shell has not been properly configured to use ‘conda activate’.
To initialize your shell, run

conda init <SHELL_NAME>

Warning

We could certainly run conda init, but doing this will alter our ~/.bashrc file with settings specific to the loaded version of Python. Because .bashrc is sourced every time we log on to the cluster, this could potentially cause a conflict if we ever want to change our Python environment.

To avoid this, our recommended method for activating the environment is to use source activate

source activate env1

If the activation is successful, your command prompt should change to indicate the environment in which you are running …

Command Line Interpreter: bash

hpcuser@easley01:~ > source activate env1
(env1) hpcuser@easley01:~ >

Step 5: Install Packages

Now, we should be able to install packages into our virtual environment using conda install <package_name>. Let’s see if we can install the curl module …

conda install curl

Step 6: Exiting

To exit the execution context of the virtual environment and get back to our regular shell, conda deactivate

Command Line Interpreter: bash

(env1) hpcuser@easley01:hpcuser > conda deactivate

Python Concurrency and Parallelism

Sample Code

Begin by copying the sample code from /tools/docs/tutorials/python/multi …

cd ~
mkdir hpc_pylab
cd hpc_pylab/
cp /tools/docs/tutorials/python/multi/* .

Or, clone the public repository, which is the authoritative source for any updated code …

cd ~
git clone https://github.com/auburn-research-computing/python_multiprocessing.git
mv python_multiprocessing hpc_pylab
cd hpc_pylab

You should now have two files threads.py and procs.py

ls -al

-rwxr-x---  1 hpcuser hpcuser 1004 Sep  2 13:48 procs.py
-rw-r--r--  1 hpcuser hpcuser 1614 Sep  2 13:03 threads.py

The procs.py file contains a code sample that employs process parallelism for a prime number calculation. The threads.py file demonstrates the use of thereads for I/O bound workloads.

Multithreading with Python

Let’s start with a basic Python program to experiment with threading.

Remember, threading is recommended for programs that spend much of their time waiting for input or output operations.

The code threads.py will issue a number of http (web) requests, which provides a good simulation of (relatively) slow I/O.

The syntax for using the threads.py looks something like …

python <threads|procs>.py [number_of_threads]

If the optional parameter number_of_threads is not provided, a single thread will be requested.

Now let’s submit an interactive job so that we can experiment with the code without worrying about overloading the login node …

srun -N1 -n1 --pty /bin/bash

node001>

Here, we request a single core from one available compute node, and once the scheduler has allocated our resources, we should be dropped onto a compute node where we can run commands interactively.

First, let’s make sure we are in the location where we copied our sample code, and set our environment to use a recent version of Python …

cd ~/hpc_pylab
module load python

Now, let’s do some experimentation to see if we can see any benefit from using threads. We’ll run the sample code with a single thread first, then increase it slightly to see if we see any performance benefit …

python threads.py 1
...take note of the total execution time...
python threads.py 4
...take note of the total execution time...
python threads.py 8
...take note of the total execution time...

Multiprocessing with Python

For CPU bound workloads, like math operations we can use parallel processes in instead of threads.

First, let’s be sure to exit our current interactive job and resubmit using –ntasks and –cpus-per-task …

srun --ntasks=1 --cpus-per-task=8 --pty /bin/bash
cd ~/hpc_pylab

Run the procs.py sample code with varied numbers of core and observe the performance impact …

python procs.py 1
... take note of the total execution time ...
python procs.py 4
... take note of the total execution time ...
python procs.py 8
... take note of the total execution time ...

Job Submission

To demonstrate the importance of job submission parameters, let’s try running the procs.py (parallel process) program using the more standard node and core allocation …

First exit any existing interactive job if you haven’t done so already. Then issue another job submission with …

srun -N1 -n1 --pty /bin/bash
cd ~/hpc_pylab
module load python
python procs.py 1
...take note of execution time...
python procs.py 8
...take note of execution time...

You should notice that the execution time remains very similar regardless of the number of cores you use.

It’s important to make sure that your job submission parameters are set according to the Python parallel model you want to use.

As a general guideline, we recommend always using –ntasks and –cpus-per-task for Python programs that use multiprocessor or threading functions.

Additional Services

Ralph Brown Draughon Library Research Data Services now offers computational support. Researchers can meet one on one with an expert in Python,R and many other data science/programming languages. More information can be found on their website https://libguides.auburn.edu/researchdata