Important:¶
This material is no longer maintained and may be out of date.¶
Please go to this link below for a current Introduction to Unix¶
Introduction to Unix¶
A hands-on workshop covering the basics of the Unix/Linux command line interface.
Overview¶
Knowledge of the Unix operating system is fundamental to being productive on HPC systems. This workshop will introduce you to the fundamental Unix concepts by way of a series of hands-on exercises.
The workshop is facilitated by experienced Unix users who will be able to guide you through the exercises and offer assistance where needed.
Learning Objectives¶
At the end of the course, you will be able to:
- Log into a Unix machine remotely
- Organise your files into directories
- Change file permissions to improve security and safety
- Create and edit files with a text editor
- Copy files between directories
- Use command line programs to manipulate files
- Automate your workflow using shell scripts
Requirements¶
- The workshop is intended for beginners with no prior experience in Unix.
- Attendees are required to bring their own laptop computers.
Introduction¶
Before we commence the hands-on part of this workshop we will first give a short 30 minute talk to introduce the Unix concepts. The slides are available if you would like. Additionally the following reference material is available for later use.
Reference Material
Topic 1: Remote log in¶
In this topic we will learn how to connect to a Unix computer via a program called ssh and run a few basic commands.
Connecting to a Unix computer¶
To begin this workshop you will need to connect to an HPC. Today we will use barcoo. The computer called barcoo.vlsci.org.au is the one that coordinates all the HPC’s tasks.
Server details:
- host: barcoo.vlsci.org.au
- port: 22
- username: (provided at workshop)
- password: (provided at workshop)
Mac OS X / Linux
Both Mac OS X and Linux come with a version of ssh (called OpenSSH) that can be used from the command line. To use OpenSSH you must first start a terminal program on your computer. On OS X the standard terminal is called Terminal, and it is installed by default. On Linux there are many popular terminal programs including: xterm, gnome-terminal, konsole (if you aren't sure, then xterm is a good default). When you've started the terminal you should see a command prompt. To log into *barcoo*, for example, type this command at the prompt and press return (where the word *username* is replaced with your *barcoo* username): *$ ssh username@barcoo.vlsci.org.au* The same procedure works for any other machine where you have an account except that if your Unix computer uses a port other than 22 you will need to specify the port by adding the option *-p PORT* with PORT substituted with the port number. You may be presented with a message along the lines of:The authenticity of host 'barcoo.vlsci.org.au (131.172.36.150)' can't be established.
...
Are you sure you want to continue connecting (yes/no)?
Windows
On Microsoft Windows (Vista, 7, 8) we recommend that you use the PuTTY ssh client. PuTTY (putty.exe) can be downloaded from this web page: [http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html](http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html) Documentation for using PuTTY is here: [http://www.chiark.greenend.org.uk/~sgtatham/putty/docs.html](http://www.chiark.greenend.org.uk/~sgtatham/putty/docs.html) When you start PuTTY you should see a window which looks something like this:

Note: for security reasons ssh will not display any characters when you enter your password. This can be confusing because it appears as if your typing is not recognised by the computer. Don’t be alarmed; type your password in and press return at the end.
barcoo is a high performance computer for Melbourne Bioinformatics users. Logging in connects your local computer (e.g. laptop) to barcoo, and allows you to type commands into the Unix prompt which are run on the HPC, and have the results displayed on your local screen.
You will be allocated a training account on barcoo for the duration of the workshop. Your username and password will be supplied at the start of the workshop.
Log out of barcoo, and log back in again (to make sure you can repeat the process).
All the remaining parts assume that you are logged into barcoo over ssh.
Exercises¶
1.1) When you’ve logged into the Unix server, run the following commands and see what they do:¶
- who
- whoami
- date
- cal
- hostname
- /vlsci/TRAINING/shared/Intro_to_Unix/hi
Answer
* **who**: displays a list of the users who are currently using this Unix computer. * **whoami**: displays your username (i.e. they person currently logged in). * **date**: displays the current date and time. * **cal**: displays a calendar on the terminal. It can be configured to display more than just the current month. * **hostname**: displays the name of the computer we are logged in to. * **/vlsci/TRAINING/shared/Intro_to_Unix/hi**: displays the text "Hello World"Topic 2: Exploring your home directory¶
In this topic we will learn how to “look” at the filesystem and further expand our repertoire of Unix commands.
Duration: 20 minutes.
Relevant commands: ls, pwd, echo, man
Your home directory contains your own private working space. Your current working directory is automatically set to your home directory when you log into a Unix computer.
2.1) Use the ls command to list the files in your home directory. How many files are there?¶
Hint
Literally, type *ls* and press the *ENTER* key.Answer
$ ls
exp01 file01 muscle.fq
The above answer is not quite correct. There are a number of hidden files in your home directory as well.
2.2) What flag might you use to display all files with the ls command? How many files are really there?¶
Hint
Take the *all* quite literally.Additional Hint
Type *ls --all* and press the *ENTER* key.Answer
**Answer 1**: *--all* (or *-a*) flag Now you should see several files in your home directory whose names all begin with a dot. All these files are created automatically for your user account. They are mostly configuration options for various programs including the shell. It is safe to ignore them for the moment.$ ls --all
. .bash_logout exp01 .lesshst
.. .bash_profile file01 muscle.fq
.bash_history .bashrc .kshrc .viminfo
2.3) What is the full path name of your home directory?¶
Hint
Remember your *Current Working Directory* starts in your *home* directory.Additional Hint
Try a shortened version of *print working directory*Answer
You can find out the full path name of the current working directory with the *pwd* command. Your home directory will look something like this:$ pwd
/home/trainingXX
echo $HOME
2.4) Run ls using the long flag (-l), how did the output change?¶
Hint
Run *ls -l*Answer
**Answer**: it changed the output to place 1 file/directory per line. It also added some extra information about each.$ ls -l
total 32
drwxr-x--- 2 training01 training 2048 Jun 14 11:28 exp01
-rw-r----- 1 training01 training 97 Jun 14 11:28 file01
-rw-r----- 1 training01 training 2461 Jun 14 11:28 muscle.fq
drwxr-x--- 2 training01 training 2048 Jun 14 11:28 exp01
\--------/ ^ \--------/ \------/ \--/ \----------/ \---/
permission | username group size date name
/---^---\
linkcount
2.5) What type of file is exp01 and muscle.fq?¶
Hint
Check the output from the *ls -l*.Answer
**Answer**: * *exp01*: Directory (given the 'd' as the first letter of its permissions) * *muscle.fq*: Regular File (given the '-')2.6) Who has permission to read, write and execute your home directory?¶
Hint
You can also give *ls* a filename as the first option.Additional Hint
*ls -l* will show you the contents of the *CWD*; how might you see the contents of the *parent* directory? (remember the slides)Answer
If you pass the *-l* flag to ls it will display a "long" listing of file information including file permissions. There are various ways you could find out the permissions on your home directory. **Method 1**: given we know the *CWD* is our home directory.$ ls -l ..
...
drwxr-x--- 4 trainingXY training 512 Feb 9 14:18 trainingXY
...
$ ls -l $HOME/..
...
drwxr-x--- 4 trainingXY training 512 Feb 9 14:18 trainingXY
...
$ ls -l ~/..
...
drwxr-x--- 4 trainingXY training 512 Feb 9 14:18 trainingXY
...
$ ls -la
...
drwxr-x--- 4 trainingXY training 512 Feb 9 14:18 .
...
man is for manual: and it will be your best friend!
Manual pages include a lot of detail about a command and its available flags/options. It should be your first (or second) port of call when you are trying to work out what a command or option does.
You can scroll up and down in the man page using the arrow keys.
You can search in the man page using the forward slash followed by the search text followed by the ENTER key. e.g. type /hello and press ENTER to search for the word hello. Press n key to find next occurance of hello etc.
You can quit the man page by pressing q.
2.7) Use the man command to find out what the -h flag does for ls¶
Hint
Give *ls* as an option to *man* command.Additional Hint
*man ls*Answer
Use the following command to view the *man* page for *ls*:$ man ls
-h, --human-readable
with -l, print sizes in human readable format (e.g., 1K 234M 2G)
2.8) Use the -h, how did the output change of muscle.fq?¶
Hint
Don't forget the *-l* option too.Additional Hint
Run *ls -lh*Answer
$ ls -lh
...
-rw-r----- 1 training01 training 2.5K Jun 14 11:28 muscle.fq
Topic 3: Exploring the file system¶
In this topic we will learn how to move around the filesystem and see what is there.
Duration: 30 minutes.
Relevant commands: pwd, cd, ls, file
3.1) Print the value of your current working directory.¶
Answer
The *pwd* command prints the value of your current working directory.$ pwd
/home/training01
3.2) List the contents of the root directory, called ‘/‘ (forward slash).¶
Hint
*ls* expects one or more anonymous options which are the files/directories to list.Answer
$ ls /
applications-merged etc media root tmp
bin home mnt sbin usr
boot lib oldhome selinux var
data lib64 opt srv
dev lost+found proc sys
3.3) Use the cd command to change your working directory to the root directory. Did your prompt change?¶
Hint
*cd* expects a single option which is the directory to change toAnswer
The *cd* command changes the value of your current working directory. To change to the root directory use the following command:$ cd /
3.4) List the contents of the CWD and verify it matches the list in 3.2¶
Hint
*ls*Answer
Assuming you have changed to the root directory then this can be achieved with *ls*, or *ls -a* (for all files) or *ls -la* for a long listing of all files. If you are not currently in the root directory then you can list its contents by passing it as an argument to ls:$ ls
applications-merged etc media root tmp
bin home mnt sbin usr
boot lib oldhome selinux var
data lib64 opt srv
dev lost+found proc sys
3.5) Change your current working directory back to your home directory. What is the simplest Unix command that will get you back to your home directory from anywhere else in the file system?¶
Hint
The answer to exercise 2.6 might give some hints on how to get back to the home directoryAdditional Hint
*$HOME*, *~*, */vlsci/TRAINING/trainXX* are all methods to name your home directory. Yet there is a simpler method; the answer is buried in *man cd* however *cd* doesn't have its own manpage so you will need to search for it.Answer
Use the *cd* command to change your working directory to your home directory. There are a number of ways to refer to your home directory:cd $HOME
cd ~
cd
3.6) Change your working directory to the following directory:¶
/vlsci/TRAINING/shared/Intro_to_Unix
Answer
**Answer**: *cd /vlsci/TRAINING/shared/Intro_to_Unix*3.7) List the contents of that directory. How many files does it contain?¶
Hint
*ls*Answer
You can do this with *ls*$ ls
expectations.txt hello.c hi jude.txt moby.txt sample_1.fastq sleepy
3.8) What kind of file is /vlsci/TRAINING/shared/Intro_to_Unix/sleepy?¶
Hint
Take the word *file* quite literally.Additional Hint
*file sleepy*Answer
Use the *file* command to get extra information about the contents of a file: Assuming your current working directory is */vlsci/TRAINING/shared/Intro_to_Unix*$ file sleepy
Bourne-Again shell script text executable
$ file /vlsci/TRAINING/shared/Intro_to_Unix/sleepy
Bourne-Again shell script text executable
man file
3.9) What kind of file is /vlsci/TRAINING/shared/Intro_to_Unix/hi?
Hint
Take the word *file* quite literally.Answer
Use the file command again. If you are in the same directory as *hi* then:$ file hi
ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux
2.6.9, not stripped
3.10) What are the file permissions of the following file and what do they mean?¶
/vlsci/TRAINING/shared/Intro_to_Unix/sleepy
Hint
Remember the *ls* command, and don't forget the *-l* flagAnswer
You can find the permissions of *sleepy* using the *ls* command with the *-l* flag. If you are in the same directory as *sleepy* then:$ ls -l sleepy
-rw-r--r-- 1 arobinson common 183 Feb 9 16:36 sleepy
3.11) Change your working directory back to your home directory ready for the next topic.¶
Hint
*cd*Answer
You should know how to do this with the cd command:cd
Topic 4: Working with files and directories¶
In this topic we will start to read, create, edit and delete files and directories.
Duration: 50 minutes.
Relevant commands: mkdir, cp, ls, diff, wc, nano, mv, rm, rmdir, head, tail, grep, gzip, gunzip
4.1) In your home directory make a sub-directory called test.¶
Hint
You are trying to *make a directory*, which of the above commands looks like a shortened version of this?Additional Hint
*mkdir*Answer
Make sure you are in your home directory first. If not *cd* to your home directory. Use the *mkdir* command to make new directories:$ mkdir test
$ ls
exp01 file01 muscle.fq test
4.2) Copy all the files from the following directory into the newly created test directory:¶
/vlsci/TRAINING/shared/Intro_to_Unix
Hint
You are trying to *copy*, which of the above commands looks like a shortened version of this?Additional Hint
$ man cp
...
SYNOPSIS
cp [OPTION]... [-T] SOURCE DEST
...
DESCRIPTION
Copy SOURCE to DEST, or multiple SOURCE(s) to DIRECTORY.
Answer
Use the *cp* command to copy files.$ cp /vlsci/TRAINING/shared/Intro_to_Unix/* test
$ cd test
$ cp /vlsci/TRAINING/shared/Intro_to_Unix/* .
cd /vlsci/TRAINING/shared/Intro_to_Unix/
cp * ~/test
Note: This exercise assumes that the copy command from the previous exercise was successful.
4.3) Check that the file size of expectations.txt is the same in both the directory that you copied it from and the directory that you copied it to.¶
Hint
Remember *ls* can show you the file size (with one of its flags)Additional Hint
*ls -l*Answer
Use *ls -l* to check the size of files. You could do this in many ways depending on the value of your working directory. We just show one possible way for each file:$ ls -l /vlsci/TRAINING/shared/Intro_to_Unix/expectations.txt
$ ls -l ~/test/expectations.txt
$ ls -lh /vlsci/TRAINING/shared/Intro_to_Unix/expectations.txt
-rw-r--r-- 1 arobinson common 1010K Mar 26 2012 /vlsci/TRAINING/shared/Intro_to_Unix/expectations.txt
Note: this exercise assumes your working directory is ~/test; if not run cd ~/test
4.4) Check that the contents of expectations.txt are the same in both the directory that you copied it from and the directory that you copied it to.¶
Hint
What is the opposite of *same*?Additional Hint
*diff*erenceAnswer
Use the *diff* command to compare the contents of two files.$ diff /vlsci/TRAINING/shared/Intro_to_Unix/expectations.txt expectations.txt
4.5) How many lines, words and characters are in expectations.txt?¶
Hint
Initialisms are keyAdditional Hint
*w*ord *c*ountAnswer
Use the *wc* (for "word count") to count the number of characters, lines and words in a file:$ wc expectations.txt
20415 187465 1033773 expectations.txt
$ wc -l expectations.txt
20415 expectations.txt
$ wc -w expectations.txt
187465 expectations.txt
$ wc -c expectations.txt
1033773 expectations.txt
4.6) Open ~/test/expectations.txt in the nano text editor, delete the first line of text, and save your changes to the file. Exit nano.¶
Hint
*nano FILENAME* Once *nano* is open it displays some command hints along the bottom of the screen.Additional Hint
*^O* means hold the *Control* (or CTRL) key while pressing the *o*. Despite what it displays, you need to type the lower-case letter that follows the *^* character. WriteOut is another name for Save.Answer
Take some time to play around with the *nano* text editor. *Nano* is a very simple text editor which is easy to use but limited in features. More powerful editors exist such as *vim* and *emacs*, however they take a substantial amount of time to learn.4.7) Did the changes you made to ~/test/expectations.txt have any effect on /vlsci/TRAINING/shared/Intro_to_Unix?¶
How can you tell if two files are the same or different in their contents?
Hint
Remember exercise 4.4Additional Hint
Use *diff*Answer
Use *diff* to check that the two files are different after you have made the change to the copy of *expectations.txt* in your *~/test* directory.diff ~/test/expectations.txt \
/vlsci/TRAINING/shared/Intro_to_Unix/expectations.txt
4.8) In your test subdirectory, rename expectations.txt to foo.txt.¶
Hint
Another way to think of it is *moving* it from *expectations.txt* to *foo.txt*Additional Hint
*mv* Use *man mv* if you need to work out how to use it.Answer
Use the *mv* command to rename the file:$ mv expectations.txt foo.txt
$ ls
foo.txt hello.c hi jude.txt moby.txt sample_1.fastq sleepy
4.9) Rename foo.txt back to expectations.txt.¶
Answer
Use the *mv* command to rename the file:$ mv foo.txt expectations.txt
$ ls
expectations.txt hello.c hi jude.txt moby.txt sample_1.fastq sleepy
4.10) Remove the file expectations.txt from your test directory.¶
Hint
We are trying to *remove* a file, check the commands at the top of this topic.Additional Hint
*rm*Answer
Use the *rm* command to remove files (carefully):$ rm expectations.txt
$ ls
hello.c hi jude.txt moby.txt sample_1.fastq sleepy
4.11) Remove the entire test directory and all the files within it.¶
Hint
We are trying to *remove a directory*.Additional Hint
You could use *rmdir* but there is an easier way using just *rm* and a flag.Answer
You could use the *rm* command to remove each file individually, and then use the *rmdir* command to remove the directory. Note that *rmdir* will only remove directories that are empty (i.e. do not contain files or subdirectories). A faster way is to pass the *-r* (for recursive) flag to *rm* to remove all the files and the directory in one go: **Logical Answer**:cd ~
rm test/*
rmdir test
cd ~
rm -r test
4.12) Recreate the test directory in your home directory and copy all the files from /vlsci/TRAINING/shared/Intro_to_Unix back into the test directory.¶
Hint
See exercises 4.1 and 4.2Answer
Repeat exercises 4.1 and 4.2.$ cd ~
$ mkdir test
$ cp /vlsci/TRAINING/shared/Intro_to_Unix/* test
4.13) Change directories to ~/test and use the cat command to display the entire contents of the file hello.c¶
Hint
Use *man* if you can't guess how it might work.Answer
$ cd ~/test
$ cat hello.c
#include <stdio.h>
int main(void) {
printf ("Hello World\n");
return 0;
}
$ ./hi
Hello World
4.14) Use the head command to view the first 20 lines of the file sample_1.fastq¶
Hint
Remember your *best* friend!Additional Hint
Use *man* to find out what option you need to add to display a given number of *lines*.Answer
$ head -20 sample_1.fastq
@IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
@IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa`]`ba`]`aaaaYD\\_a``XT
@IRIS:7:1:17:1757#0/1
TTTTCTCGACGATTTCCACTCCTGGTCNACGAATCC
+IRIS:7:1:17:1757#0/1
aaaaaa``aaa`aaaa_^a```]][Z[DY^XYV^_Y
@IRIS:7:1:17:1479#0/1
CATATTGTAGGGTGGATCTCGAAAGATATGAAAGAT
+IRIS:7:1:17:1479#0/1
abaaaaa`a```^aaaaa`_]aaa`aaa__a_X]``
@IRIS:7:1:17:150#0/1
TGATGTACTATGCATATGAACTTGTATGCAAAGTGG
+IRIS:7:1:17:150#0/1
abaabaa`aaaaaaa^ba_]]aaa^aaaaa_^][aa
4.15) Use the tail command to view the last 8 lines of the file sample_1.fastq¶
Hint
It's very much like *head*.Answer
tail -8 sample_1.fastq
@IRIS:7:32:731:717#0/1
TAATAATTGGAGCCAAATCATGAATCAAAGGACATA
+IRIS:7:32:731:717#0/1
ababbababbab]abbaa`babaaabbb`bbbabbb
@IRIS:7:32:731:1228#0/1
CTGATGCCGAGGCACGCCGTTAGGCGCGTGCTGCAG
+IRIS:7:32:731:1228#0/1
`aaaaa``aaa`a``a`^a`a`a_[a_a`a`aa`__
4.16) Use the grep command to find out all the lines in moby.txt that contain the word “Ahab”¶
Hint
One might say we are 'looking for the *pattern* "Ahab"'Additional Hint
$ man grep
...
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
...
Answer
$ grep Ahab moby.txt
"Want to see what whaling is, eh? Have ye clapped eye on Captain Ahab?"
"Who is Captain Ahab, sir?"
"Aye, aye, I thought so. Captain Ahab is the Captain of this ship."
... AND MUCH MUCH MORE ...
$ grep Ahab moby.txt | wc -l
491
4.17) Use the grep command to find out all the lines in expectations.txt that contain the word “the” with a case insensitive search (it should count “the” “The” “THE” “tHe” etcetera).¶
Hint
One might say we are *ignoring case*.Additional Hint
$ man grep
...
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is specified by POSIX.)
...
Answer
Use the *-i* flag to *grep* to make it perform case insensitive search:$ grep -i the expectations.txt
The Project Gutenberg EBook of Great Expectations, by Charles Dickens
This eBook is for the use of anyone anywhere at no cost and with
re-use it under the terms of the Project Gutenberg License included
[Project Gutenberg Editor's Note: There is also another version of
... AND MUCH MUCH MORE ...
$ grep -i the expectations.txt | wc -l
8165
4.18) Use the gzip command to compress the file sample_1.fastq. Use gunzip to decompress it back to the original contents.¶
Hint
Use the above commands along with *man* and *ls* to see what happens to the file.Answer
Check the file size of sample_1.fastq before compressing it:# check filesize
$ ls -l sample_1.fastq
-rw-r--r-- 1 training01 training 90849644 Jun 14 20:03 sample_1.fastq
# compress it (takes a few seconds)
$ gzip sample_1.fastq
# check filesize (Note: its name changed)
$ ls -l sample_1.fastq.gz
-rw-r--r-- 1 training01 training 26997595 Jun 14 20:03 sample_1.fastq.gz
# decompress it
$ gunzip sample_1.fastq.gz
$ ls -l sample_1.fastq
-rw-r--r-- 1 training01 training 90849644 Jun 14 20:03 sample_1.fastq
Topic 5: Pipes, output redirection and shell scripts¶
In this section we will cover a lot of the more advanced Unix concepts; it is here where you will start to see the power of Unix. I say start because this is only the “tip of the iceberg”.
Duration: 50 minutes.
Relevant commands: wc, paste, grep, sort, uniq, nano, cut
5.1) How many reads are contained in the file sample_1.fastq?¶
Hint
Examine some of the file to work out how many lines each *read* takes up.Additional Hint
Count the number of linesAnswer
We can answer this question by counting the number of lines in the file and dividing by 4:$ wc -l sample_1.fastq
3000000
$ echo "3000000 / 4" | bc
750000
5.2) How many reads in sample_1.fastq contain the sequence GATTACA?¶
Hint
Check out exercise 4.16Answer
Use *grep* to find all the lines that contain *GATTACA* and "pipe" the output to *wc -l* to count them:$ grep GATTACA sample_1.fastq | wc -l
1119
5.3) On what line numbers do the sequences containing GATTACA occur?¶
Hint
We are looking for the *line numbers*.Additional Hint
Check out the manpage for *grep* and/or *nl*Answer
You can use the *-n* flag to grep to make it prefix each line with a line number: **Answer 1**:$ grep -n GATTACA sample_1.fastq
5078:AGGAAGATTACAACTCCAAGACACCAAACAAATTCC
7170:AACTACAAAGGTCAGGATTACAAGCTCTTGCCCTTC
8238:ATAGTTTTTTCGATTACATGGATTATATCTGTTTGC
... AND MUCH MUCH MORE ...
$ nl sample_1.fastq | grep GATTACA
5078 AGGAAGATTACAACTCCAAGACACCAAACAAATTCC
7170 AACTACAAAGGTCAGGATTACAAGCTCTTGCCCTTC
8238 ATAGTTTTTTCGATTACATGGATTATATCTGTTTGC
... AND MUCH MUCH MORE ...
$ nl sample_1.fastq | grep GATTACA | cut -f 1
5078
7170
8238
... AND MUCH MUCH MORE ...
$ grep -n GATTACA sample_1.fastq | cut -d: -f 1
5078
7170
8238
... AND MUCH MUCH MORE ...
5.4) Use the nl command to print each line of sample_1.fastq with its corresponding line number at the beginning.¶
Hint
Check answer to 5.3.Answer
$ nl sample_1.fastq
1 @IRIS:7:1:17:394#0/1
2 GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
3 +IRIS:7:1:17:394#0/1
4 aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
5 @IRIS:7:1:17:800#0/1
6 GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
7 +IRIS:7:1:17:800#0/1
8 ababbaaabaaaaa`]`ba`]`aaaaYD\\_a``XT
... AND MUCH MUCH MORE ...
5.5) Redirect the output of the previous command to a file called sample_1.fastq.nl.¶
Check the first 20 lines of sample_1.fastq.nl with the head command. Use the less command to interactively view the contents of sample_1.fastq.nl (use the arrow keys to navigate up and down, q to quit and ‘/‘ to search). Use the search facility in less to find occurrences of GATTACA.
Hint
Ok that one was tough, *> FILENAME* is how you do it if you didn't break out an internet search for "redirect the output in Unix"Answer
$ nl sample_1.fastq > sample_1.fastq.nl
$ head -20 sample_1.fastq.nl
1 @IRIS:7:1:17:394#0/1
2 GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
3 +IRIS:7:1:17:394#0/1
4 aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
5 @IRIS:7:1:17:800#0/1
6 GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
7 +IRIS:7:1:17:800#0/1
8 ababbaaabaaaaa`]`ba`]`aaaaYD\\_a``XT
...
$ less sample_1.fastq.nl
5.6) The four-lines-per-read format of FASTQ is cumbersome to deal with. Often it would be preferable if we could convert it to tab-separated-value (TSV) format, such that each read appears on a single line with each of its fields separated by tabs. Use the following command to convert sample_1.fastq into TSV format:¶
$ cat sample_1.fastq | paste - - - - > sample_1.tsv
Answer
The *'-'* (dash) character has a special meaning when used in place of a file; it means use the standard input instead of a real file. Note: while it is fairly common in most Unix programs, not all will support it. The *paste* command is useful for merging multiple files together line-by-line, such that the *Nth* line from each file is joined together into one line in the output, separated by default with a *tab* character. In the above example we give paste 4 copies of the contents of *sample_1.fastq*, which causes it to join consecutive groups of 4 lines from the file into one line of output.5.7) Do you expect the output of the following command to produce the same output as above? and why?¶
$ paste sample_1.fastq sample_1.fastq sample_1.fastq sample_1.fastq > sample_1b.tsv
Try it, see what ends up in sample_1b.tsv (maybe use less)
Hint
Use *less* to examine it.Answer
**Answer**: No, in the second instance we get 4 copies of each line. **Why**: In the first command *paste* will use the input file (standard input) 4 times since the *cat* command will only give one copy of the file to *paste*, where as, in the second command *paste* will open the file 4 times. Note: this is quite confusing and is not necessory to remember; its just an interesting side point.5.8) Check that sample_1.tsv has the correct number of lines. Use the head command to view the first 20 lines of the file.¶
Hint
Remember the *wc* command.Answer
We can count the number of lines in *sample_1.tsv* using *wc*:$ wc -l sample_1.tsv
$ head -20 sample_1.tsv
5.9) Use the cut command to print out the second column of sample_1.tsv. Redirect the output to a file called sample_1.dna.txt.¶
Hint
See exercise 5.3 (for cut) and 5.5 (redirection)Answer
The file sample_1.tsv is in column format. The cut command can be used to select certain columns from the file. The DNA sequences appear in column 2, we select that column using the -f 2 flag (the f stands for "field").cut -f 2 sample_1.tsv > sample_1.dna.txt
5.10) Use the sort command to sort the lines of sample_1.dna.txt and redirect the output to sample_1.dna.sorted.txt. Use head to look at the first few lines of the output file. You should see a lot of repeated sequences of As.¶
Hint
Use *man* (sort) and see exercise 5.5 (redirection)Answer
$ sort sample_1.dna.txt > sample_1.dna.sorted.txt
5.11) Use the uniq command to remove duplicate consecutive lines from sample_1.dna.sorted.txt, redirect the result to sample_1.dna.uniq.txt. Compare the number of lines in sample1_dna.txt to the number of lines in sample_1.dna.uniq.txt.¶
Hint
I am pretty sure you have already used *man* (or just guessed how to use *uniq*). You're also a gun at redirection now.Answer
$ uniq sample_1.dna.sorted.txt > sample_1.dna.uniq.txt
$ wc -l sample_1.dna.sorted.txt
750000
$ wc -l sample_1.dna.uniq.txt
614490
5.12) Can you modify the command from above to produce only those sequences of DNA which were duplicated in sample_1.dna.sorted.txt?¶
Hint
Checkout the *uniq* manpageAdditional Hint
Look at the man page for uniq.Answer
Use the *-d* flag to *uniq* to print out only the duplicated lines from the file:$ uniq -d sample_1.dna.sorted.txt > sample_1.dna.dup.txt
5.13) Write a shell pipeline which will print the number of duplicated DNA sequences in sample_1.fastq.¶
Hint
That is, *piping* most of the commands you used above instead of redirecting to fileAdditional Hint
i.e. 6 commands (*cat*, *paste*, *cut*, *sort*, *uniq*, *wc*)Answer
Finally we can 'pipe' all the pieces together into a sophisticated pipeline which starts with a FASTQ file and ends with a list of duplicated DNA sequences: **Answer**:$ cat sample_1.fastq | paste - - - - | cut -f 2 | sort | uniq -d | wc -l
56079
5.14) (Advanced) Write a shell script which will print the number of duplicated DNA sequences in sample_1.fastq.¶
Hint
Check out the *sleepy* file (with *cat* or *nano*); there is a bit of magic on the first line that you will need. You also need to tell bash that this file can be executed (check out *chmod* command).Answer
Put the answer to *5.13* into a file called *sample_1_dups.sh* (or whatever you want). Use *nano* to create the file. **Answer**: the contents of the file will look like this:#!/bin/bash
cat sample_1.fastq | paste - - - - | cut -f 2 | sort | uniq -d | wc -l
$ chmod +x sample_1_dups.sh
$ ./sample_1_dups.sh
5.15) (Advanced) Modify your shell script so that it accepts the name of the input FASTQ file as a command line parameter.¶
Hint
Shell scripts can refer to command line arguments by their position using special variables called *$0*, *$1*, *$2* and so on.Additional Hint
*$0* refers to the name of the script as it was called on the command line. *$1* refers to the first command line argument, and so on.Answer
Copy the shell script from *5.14* into a new file:$ cp sample_1_dups.sh fastq_dups.sh
#!/bin/bash
cat $1 | paste - - - - | cut -f 2 | sort | uniq -d | wc -l
$ ./fastq_dups.sh sample_1.fastq
#!/bin/bash
if [ $# -eq 1 ]; then
cat $1 | paste - - - - | cut -f 2 | sort | uniq -d | wc -l
else
echo "Usage: $0 <fastq_filename>"
exit 1
fi
5.16) (Advanced) Modify your shell script so that it accepts zero or more FASTQ files on the command line argument and outputs the number of duplicated DNA sequences in each file.¶
Answer
We can add a loop to our script to accept multiple input FASTQ files:#!/bin/bash
for file in $@; do
dups=$(cat $file | paste - - - - | cut -f 2 | sort | uniq -d | wc -l)
echo "$file $dups"
done
./fastq_dups.sh sample_1.fastq sample_2.fastq sample_3.fastq
sample_1.fastq 56079
sample_2.fastq XXXXX
sample_3.fastq YYYYY
Finished¶
Well done, you learnt a lot over the last 5 topics and you should be proud of your achievement; it was a lot to take in.
From here you should be comfortable around the Unix command line and ready to take on the HPC Workshop.
You will no-doubt forget a lot of what you learnt here so I encourage you to save a link to this workshop for later reference.
Thank you for your attendance, please don’t forget to complete the training survey and give it back to the workshop facilitators.