Python Fundamentals Training¶
Introduction¶
When you first login to Easley, the system python is available by default. However this may not be the ideal version of python for your workload. The following commands will return the version of python and the path where the version of python is installed.
The first command below will give you details about the version of python
python "-V"
Python 2.7.5
The second command will give you the path.
which python
/usr/bin/python
By default python 2.7.5 is available for use. However, if you wish to choose another version of python or prefer to use anaconda, run the following command:
module av
------------- Languages & Environments --------
python/anaconda/2.7.14 python/anaconda/3.6.3 python/anaconda/3.7.0 python/intel/3.7.9
python/anaconda/2.7.15 python/anaconda/3.6.4 python/anaconda/3.7.4 python/3.8.6
python/anaconda/3.5.2-0 python/anaconda/3.6.5 python/anaconda/3.8.6 python/3.9.2 (D)
Under Languages and Environments, you will see a variety of python modules to choose from. The default python will be python/3.9.2
module load python
python -V
Python 3.9.2
which python
/tools/python-3.9.2/bin/python
Listing Available packages¶
The two most popular package managers for installing and listing packages are pip and conda. There are several system wide packages installed and available for use. They vary depending on the version of anaconda or python you load. To check for the list of packages available use the following command.
For python modules use the following:
module load python
pip list installed
For anaconda/python modules use the following command:
module unload python
module load python/anaconda/3.8.6
conda list
Python or Anaconda?¶
Anaconda is an opensource Python distribution that contains hundreds of data science libraries. It may be beneficial for users who specialize in data science to use the python anaconda modules.
Virtual Environments and installing packages locally¶
If you load a version of python that does not include a package nessessary for your workload you can submit a request with our office. However you can also install the package(s) locally.
To work around the problems with limited privileges, it is possible to use your home directory /home/<userid>
where you have full rights, to install special entire instances of Python.
pip install --user package_name
However, if you are planning on using multiple versions of the same package for different projects, you may run into dependency issues. Virtual environments can be used to solve this issue. Virtual environments are recommended for python-based projects, and it is recommended that you create a new environment for every project.
The Python virtualenv
module allows us to create encapsulated Python instance where we can perform module installations without administrative privileges and avoid software conflicts. The virtualenv
feature actually creates a copy of the core binaries and libraries needed to run a Python program, including an isolated site-packages
location in your home directory (or other specified location to which you have write permissions) where modules are typically installed. This feature is very helpful for shared systems like HPC clusters so that you can customize Python for your research workflow without administrative privileges.
Python 3 with pip¶
This section will demonstrate how to set up a virtual environment using python3 with the pip package manager.
Step 1: Get Home¶
First change into your home directory by typing the following command:
cd
You can always use the following command to check the current directory you are in.
pwd
Step 2: Set the Python Environment¶
Before creating the virtual environment, check which modules are loaded using module list
. Load any additional modules needed for your workflow with module load
, and unload any unneccessary modules with module unload
.
For this example, we use default (latest) version of Python …
module load python
Most versions of Python available on HPC will have the virtualenv
feature available for use globally, but it’s probably still a good idea to create an instance of the feature in our home directory. The following pip
command uses the --user
option, which will install virtualenv
to a location in our home directory.
pip3 install --user virtualenv
If the command is successful, you should be able to see what files were installed by looking into a special hidden path (“.local”) in your home directory. The pip
command, when used with --user
, creates this path to hold user specific python files and configuration items …
ls -al ~/.local/lib/python-<version>/site-packages
Caution
pip3 install --user <package_name>
is similar to python virtual environments, in that it creates a location in your home directory where you can install modules. However, it’s important to note that simply using pip --user
does not generate all of the necessary changes needed to work in a fully isolated virtual environment.
Step 3: Create the Virtual Environment¶
With virtualenv
installed, we need to create a new location in our home directory to hold the virtual environment files. You may end up with multiple virtual environments for various projects, so it’s a good idea give the new path a descriptive name (e.g. pytorch_project1). Here we just call our environment “env1” as a generic identifier, but you can name the path whatever you like.
mkdir ~/env1
cd ~/env1/
From within the new directory, we can now create a new virtual environment by telling python to copy all of the essential files to a location within our new project folder (“env1”). We use “env1_python” here, but you can name it whatever you want …
python3 -m virtualenv env1_python
Because Python has to perform a number of file copies in order to create the virtual environment, this command may take a little while to complete. We can see what has been copied by looking into the newly created “env1_python” directory …
ls -al env1_python
ls -al env1_python/bin
ls -al env1_python/lib/python<version>
Step 4: Activate the Virtual Environment¶
You may notice that several “activate” scripts have been created for us in env1_python/bin
. These files perform all of the necessary tasks to ensure that your Linux shell is configured with all of the appropriate paths, libraries, etc. needed to run in the isolated virtual environment. In order to actually switch to this specific execution context, we use the source
command along with the location of the activate
script …
source env1_python/bin/activate
With the environment now set after calling source
, you should see that your prompt has changed to indicate the name of the virtual environment …
(env1_python) hpcuser@easley01:env1_python >
(env1_python) hpcuser@easley01:env1_python > which python
/home/hpcuser/env1/env1_python/bin/python
Step 5: Install Packages¶
From within this new Python execution context, we can now perform operations like we normally would for a locally installed python. Most importantly, you can now install packages using pip …
(env1_python) hpcuser@easley01:env1_python > pip3 install <package_name>
(env1_python) hpcuser@easley01:env1_python > python3
To confirm, we can take a look in the virtual environments dedicated site-packages
directory, and we should see that any modules we have installed while activated are placed in that location …
(env1_python) hpcuser@easley01:env1_python > ls env/lib/python3.9/site-packages/
And, we can run interactively in our customized shell to test some code …
(env1_python) hpcuser@easley01:env1_python > python3
Before creating the virtual environment, check which modules are loaded, and load any additional modules needed. For this example use the latest version of Python 3. Move to the next section if you plan on using conda as the package manager.
Step 6: Deactivate Virtual Environment¶
Final step is to deactivate the virtual environment using the following steps below:
(env1_python) hpcuser@easley01:env1_python > deactivate env1_python
Anaconda Python 3 with conda¶
Anaconda is a special distribution of Python that is purposed for scientific workloads. Anaconda claims to provide everything you need for data science development and experimentation, including it’s own specialized package manager. Some researchers prefer to use Anaconda for their Python workloads. The following steps describe the recommended method for creating an Anaconda (conda) virtual environment …
Step 1: Housekeeping¶
First, ensure that you are in your home directory and have a clean environment, with no other Python-specific modules loaded…
cd
module list
If you see any Python modules loaded, it might be a good idea to log out, then back in to reset your environment.
Step 2: Load Anaconda¶
Once you have confirmed that your environment is clean, you can load one of the Anaconda Python modules from which you can begin configuring your virtual environment. Here, we load the default (latest) version…
module av python
Command Line Interpreter: bash
module load python/anaconda
Step 3: Create a New Virtual Environment¶
With Anaconda Python loaded, your environment should now be set to use the Python binaries. Because the Python installation is purposed for a multi-user environment, the path for the default package location is not writable by normal users. In order to use our own location for packages, we need to set the CONDA_PKGS_DIRS
environment variable so that conda
does not attempt to write to the shared location…
export CONDA_PKGS_DIRS=~/.conda/pkgs
Now we can create a new virtual environment using conda create
using the -n
argument to give it a specific name. For this example, the virtual environment name will be env1
. …
conda create -n env1
We can see from the output that conda wants to create the environment in a location in our home directory ~/.conda/envs/env1
. If conda create
was successful, we should see some new files and directories in that location …
Command Line Interpreter: bash
Step 4: Activate the Environment¶
Also from the conda create
output, we can see that Anaconda Python has advised us to use conda activate env1
to use the environment. However, if you have not yet used Anaconda Python, you might see an error when running that command …
Command Line Interpreter: bash
Warning
We could certainly run conda init
, but doing this will alter our ~/.bashrc
file with settings specific to the loaded version of Python. Because .bashrc
is sourced every time we log on to the cluster, this could potentially cause a conflict if we ever want to change our Python environment.
To avoid this, our recommended method for activating the environment is to use source activate
…
source activate env1
If the activation is successful, your command prompt should change to indicate the environment in which you are running …
Command Line Interpreter: bash
Step 5: Install Packages¶
Now, we should be able to install packages into our virtual environment using conda install <package_name>
. Let’s see if we can install the curl module …
conda install curl
Step 6: Exiting¶
To exit the execution context of the virtual environment and get back to our regular shell, conda deactivate
…
Command Line Interpreter: bash
Python Concurrency and Parallelism¶
Sample Code¶
Begin by copying the sample code from /tools/docs/tutorials/python/multi …
cd ~
mkdir hpc_pylab
cd hpc_pylab/
cp /tools/docs/tutorials/python/multi/* .
Or, clone the public repository, which is the authoritative source for any updated code …
cd ~
git clone https://github.com/auburn-research-computing/python_multiprocessing.git
mv python_multiprocessing hpc_pylab
cd hpc_pylab
You should now have two files threads.py and procs.py
ls -al
-rwxr-x--- 1 hpcuser hpcuser 1004 Sep 2 13:48 procs.py
-rw-r--r-- 1 hpcuser hpcuser 1614 Sep 2 13:03 threads.py
The procs.py file contains a code sample that employs process parallelism for a prime number calculation. The threads.py file demonstrates the use of thereads for I/O bound workloads.
Multithreading with Python¶
Let’s start with a basic Python program to experiment with threading.
Remember, threading is recommended for programs that spend much of their time waiting for input or output operations.
The code threads.py will issue a number of http (web) requests, which provides a good simulation of (relatively) slow I/O.
The syntax for using the threads.py looks something like …
python <threads|procs>.py [number_of_threads]
If the optional parameter number_of_threads is not provided, a single thread will be requested.
Now let’s submit an interactive job so that we can experiment with the code without worrying about overloading the login node …
srun -N1 -n1 --pty /bin/bash
node001>
Here, we request a single core from one available compute node, and once the scheduler has allocated our resources, we should be dropped onto a compute node where we can run commands interactively.
First, let’s make sure we are in the location where we copied our sample code, and set our environment to use a recent version of Python …
cd ~/hpc_pylab
module load python
Now, let’s do some experimentation to see if we can see any benefit from using threads. We’ll run the sample code with a single thread first, then increase it slightly to see if we see any performance benefit …
python threads.py 1
...take note of the total execution time...
python threads.py 4
...take note of the total execution time...
python threads.py 8
...take note of the total execution time...
Multiprocessing with Python¶
For CPU bound workloads, like math operations we can use parallel processes in instead of threads.
First, let’s be sure to exit our current interactive job and resubmit using –ntasks and –cpus-per-task …
srun --ntasks=1 --cpus-per-task=8 --pty /bin/bash
cd ~/hpc_pylab
Run the procs.py sample code with varied numbers of core and observe the performance impact …
python procs.py 1
... take note of the total execution time ...
python procs.py 4
... take note of the total execution time ...
python procs.py 8
... take note of the total execution time ...
Job Submission¶
To demonstrate the importance of job submission parameters, let’s try running the procs.py (parallel process) program using the more standard node and core allocation …
First exit any existing interactive job if you haven’t done so already. Then issue another job submission with …
srun -N1 -n1 --pty /bin/bash
cd ~/hpc_pylab
module load python
python procs.py 1
...take note of execution time...
python procs.py 8
...take note of execution time...
You should notice that the execution time remains very similar regardless of the number of cores you use.
It’s important to make sure that your job submission parameters are set according to the Python parallel model you want to use.
As a general guideline, we recommend always using –ntasks and –cpus-per-task for Python programs that use multiprocessor or threading functions.
Additional Services¶
Ralph Brown Draughon Library Research Data Services now offers computational support. Researchers can meet one on one with an expert in Python,R and many other data science/programming languages. More information can be found on their website https://libguides.auburn.edu/researchdata