Advanced Command Line Tutorial#

In the last section, we practiced using a few tools and introduced the idea that the command line is just a way of talking to your operating system using text commands rather than by clicking on icons. In this section, we’ll introduce some more advanced tools, and discuss general principles that will help you during your data science career.

Like everything in this course, this section will focus on the tools that are most relevant for an applied data scientist. We will not try to cover advanced bash programming (loops, function definitions, etc.), because anything you can do that way you will also be able to do in Python, for which you are receiving lots of additional training. If you want to learn the advanced bash skills, there are lots of great tutorials out there (e.g., the full DataCamp tutorial). Instead, the focus here is on skills you’re likely to need when using Git, managing packages in Python, or getting stuff set up on remote servers so you can run your R or Python scripts.

In the examples below, we’ll be working with the example data in the Example_Data/command_line folder in this repository, which you can download if you wish to follow along.

Command Line Syntax: General Principles#

You may have noticed that there are some patterns to how the command line tools we’ve covered so far operate. In this section we’ll introduce some general principles that are used by most command line programs (like git, python, julia, conda, zip, ssh, etc.):

1: The first thing you type into the shell is actually just the name of a program. This may not be obvious, but when you type cd and ls, you’re actually asking your operating system to find and execute programs with those names. If you wanted to, you could actually find individual files called cd and ls that the operating system is running when you use those commands. And later on, you’ll spend a lot of time using the commands python or git, which are just ways of asking the operating system to execute those programs.

2: The things that come after the program being called are called “arguments,” and they are passed to the program being called. For example, if you were to run python my_file.py, you are calling the program python and passing it the name of a file as an argument (which it will then execute). What arguments a function accepts or requires depends on the program.

3: The shell is very sensitive to spaces. If you have file names with spaces, you’ll need to use quotes or escape the spaces in the file names by preceding them with a \ (e.g., less this\ is\ my\ file.txt).

4: Many programs have options that are activated with “flags.” A flag is usually a single dash followed by a single letter. For example, you can ask the ls function to display the contents of a directory in a list format using the flag -l.

# Normal `ls` display:
cd ~/github/practicaldatascience/Example_Data/command_line
ls
# With the `-l` flag, it also shows file sizes, when last modified, and all sorts of operating
# system information that you don't need to worry about. 
ls -l

One Dash Versus Two Dashes

Many flags also have a longer (easier to read) version that you call with two dashes. Basically, if a shell command sees one dash, it knows that each letter immediately afterwards is a different flag. If it sees two dashes, it knows that everything after the dash before the first space is a single flag name.

(Two-dash options are common in modern commands, but they aren’t always available in older commands like cd and ls. In the early days of programming, people didn’t see being “user friendly” as a priority.)

To illustrate, consider the less (move) command. If you want it to report the currently installed version, you can either type less -V (single dash followed by a single letter) or less --version (double dash followed by a full word).

Note that because a single dash tells the shell that what follows is a single-letter flag, you can actually pile up flags after a single dash. For example, we already know that -l tells ls to show files in a list. -h says to include a (human-readable) file size. Since each flag after a single dash is only one letter, if we squish them together, the command line knows that it’s a series of one-letter flags (since lh itself is two letters, it wouldn’t be a valid single-letter flag).

# You can use these separately 
ls -l -h
# Or together!
ls -lh

Getting Help#

Now that you know that many commands have options, the obvious next question is: how do I learn what options are available?

The answer is that most commands have help files you can get either by typing NAMEOFCOMMAND -h or man NAMEOFCOMMAND.

-h

For most commands, NAMEOFCOMMAND -h or NAMEOFCOMMAND --help will bring up a small guide to command options. For example, python -h or python --help bring up:

usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Options and arguments (and corresponding environment variables):
-b     : issue warnings about str(bytes_instance), str(bytearray_instance)
         and comparing bytes/bytearray with str. (-bb: issue errors)
-B     : don't write .pyc files on import; also PYTHONDONTWRITEBYTECODE=x
-c cmd : program passed in as string (terminates option list)
-d     : debug output from parser; also PYTHONDEBUG=x
-E     : ignore PYTHON* environment variables (such as PYTHONPATH)
-h     : print this help message and exit (also --help)
-i     : inspect interactively after running script; forces a prompt even
         if stdin does not appear to be a terminal; also PYTHONINSPECT=x

man

While NAMEOFCOMMAND -h works for most modern commands, for very old commands (those that have been around since the early days of computing like ls or cd), you often need to use man NAMEOFCOMMAND (man is short for manual). To illustrate, man ls brings up:

LS(1)                     BSD General Commands Manual                    LS(1)

NAME
     ls -- list directory contents

SYNOPSIS
     ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]

DESCRIPTION
     For each operand that names a file of a type other than directory, ls
     displays its name as well as any requested, associated information.  For
     each operand that names a file of type directory, ls displays the names
     of files contained within that directory, as well as any requested, asso-
     ciated information.
...

NOTE: Windows bash clients (like Cmder and git bash often don’t support man. To get help for old commands, try typing what you would type if man actually worked, but into Google (e.g., Google man rmdir).

The “Recursive” Flag#

Now that you’re familiar with the idea of using flags to modify the behavior of commands, there’s one kinda weird flag that’s worth discussing in detail: -r, or occassionally -R.

Many command line tools are designed to operate on files, and by default they won’t work if you try to use them on folders (directories). For example, if you try to copy a folder with cp, you’ll get the following error:

➜  cp a_folder ~/desktop
cp: a_folder is a directory (not copied).

To get tools that only work on files to work on folders, we use the -r. r stands for “recursive,” and basically it says “do what I’m asking you to do to this directory to every file in this directory.”

Places this comes up a lot:

  • Deleting folders requires rm -r

  • Copying folders requires cp -r

  • Compressing a folder with zip requires zip -r

Invisible Files#

Now that you’re comfortable with options, it’s time to introduce you to a dark secret of modern operating systems: there are invisible files everywhere. When a programmer needs to have a file or folder, but doesn’t want to show it to the user, (s)he prefixes the file name with a single period (.). The operating sytem then hides this files from the user.

But now you can see these invisible files using the command line. Just use the -a flag (short for “all”) for the ls command to have it show you all the files that are there:

# You thought you knew what was in this folder:
ls -l
# But there was another file hiding! Notice that `.this_file_is_invisible.txt` and `.DS_Store` were hidden before? 
ls -la

Yup! .this_file_is_invisible.txt and .DS_Store were there all along! These are normal files — you can move them, rename them, or open them like any other — they are just hidden by default. However, be careful about modifying these so-called “dotfiles” — they are often hidden for a reason. Dotfiles are used by programs to store configuration or settings data, and they’re usually hidden because casual users can easily screw them up.

In this case, .this_file_is_invisible.txt is just a plain text document I created for this exercise. .DS_Store, by contrast, is a file created by the macOS operating system to store information, like how this folder should be displayed when opened. This is sufficently unimportant that playing with it won’t ruin your computer, but there’s not really anything in there you’re meant to change.

This trick is useful to know, because some programs (like Git) rely on settings hidden in dotfiles. In fact, you should try to memorize this command (ls -la) — many people use it more than plain old ls.

How common are dotfiles? Extremely. See for yourself: if you go to your home directory, you’ll find that all sorts of programs have been storing their settings and installed packages in dotfiles. Just run cd ~ (remember that ~ is just a short hand for your home folder, which on most systems is /users/YOURUSERNAME), then ls -la.

Feel free to explore these files and folders if you want, but I would strongly suggest against editing anything unless you know what you’re doing — unlike .DS_Store files, changing some of these can really screw up some applications.

Wildcards#

As we saw in the last set of exercises, one of the most powerful command line tricks (and one of the places where using the command line can be much easier than trying to do things with your mouse) is the use of wildcards. Any time you are listing files, you can use an asterick (*) to allow any pattern to appear in part of a filename. For example, to list all the CSV files in a folder (but only the CSVs), you can type:

cd example_csvs
ls *.csv

Or, if you only wanted to see the CSVs that have data from the month of February (in this case, the files with 2018_2_ in the middle of the file name), you could type:

ls *2018_2*

This is an extremely powerful tool, and one you’ll use a lot. Just be careful — wildcards can also get you in trouble. For example, suppose you wanted to erase all the CSVs from January. You might be inclined to type rm *2018_1*. But that pattern will catch much more than just January…

ls *2018_1*

It will also catch (and if you were to use rm, delete) November (2018_11_) and December (2018_12_)! To only catch January, you’d have to be more specific and use rm *2018_1_ (with the trailing underscore).

Using the Outputs of Commands#

We’ve noted that there are several commands that will print information to the terminal for you to see. But sometimes we want to do something with the information that programs return. For example, it’s nice that ls shows us the contents of a folder, but what if we wanted to save that to disk so we could open it and use it in a different program?

Saving to Disk#

You can re-direct the output of any program that prints something to the screen to a file with the > or >> commands. For example, to save the output of the ls command to a file on your desktop, you would type ls > ls_output.txt.

Note that this will only work for commands that print something directly to the screen (like ls, or cat). It won’t work for programs that just open up an interactive session (like less).

WARNING: A single > will overwrite the old file and create a new one, but there is also a >> command that will append to an existing file, not overwrite.

Piping#

Sometimes, instead of saving the output of a program to disk, you want to pass it to another program to analyze. This practice — using the output of one program as input to another — is called “piping,” and it can be very powerful (and is actually used in many programming languages, not just bash).

For example, suppose we wanted to count the number of .csv files in a folder. One way to do this would be to use ls *.csv to save the names of all the files in a directory to disk, then use the wc command (short for “word count”) to count the number of lines in that file. To do so, we save the output of ls -1 *.csv to disk (the -1 option forces ls to put one file name on each line), then point wc at the file using the -l option (which counts total lines, since if a file name has a space it would be counted as multiple words. See man wc for more information on how wc works):

ls -1 *.csv
ls -1 *.csv > ~/files_in_folder.txt
wc -l ~/files_in_folder.txt

But obviously that seems wasteful. Why do we have to save to disk just to move the data from one file to another?

The answer is we don’t! Instead we can use the pipe operator: |. The pipe operator says “just pass the output of the first command as an argument to the second command.” And now we can do:

ls -1 | wc -l

The nano Editor#

It is often the case when working at the command line that one wants to actually edit a file, not just look at it or move it around. For small, quick edits, bash comes with an extremely useful tool for this purpose: nano. Just type nano FILENAME on almost any system, and you can edit your file without opening or installing additional programs.

(Note for MIDS Students: you can also use Emacs for the same purpose, since you’ve already gone through the pain of learning it!)

The PATH Variable#

The last feature of the command line that is important to understand is the PATH variable. We won’t get into all the intricacies of the PATH variable here, but having a basic understanding of its purpose and function will likely prove useful to you if you ever have to troubleshoot problems in the future.

Have you ever wondered how the command line knows what to do when you type a command like python or ls? How does it know what program to run, especially on a computer that might have multiple installations of a program like Python?

The answer is that your system has a list of folders stored in an “environment variable” called PATH, and when you run a command (like python), it goes through those folders in order until it finds an executable file with the name of the command you typed. Then, when it finds that file, it executes that program and stops looking.

You can see the value of the PATH variable on your computer by typing echo $PATH (echo says “please evaluate and print what follows,” and the dollar sign in $PATH says “please fill in the value of the environment variable named PATH.” On my system, the PATH variable looks like this:

echo $PATH

That means that when I type python, my computer will first look in the folder /users/nick/opt/miniconda/bin to see if there’s a file named python it can run. If it can’t find one there, it moves on to /users/nick/opt/miniconda/condabin, etc.

(You’ll see that /Users/Nick/opt/miniconda/bin appears twice in my PATH. That’s because the program I’m working with adds /Users/Nick/opt/miniconda/bin to my PATH when it starts up, leading to duplication. Thankfully, duplication doesn’t really matter — the time it takes the computer to check that folder twice is miniscule).

Why This Is Useful to Know#

In a perfect world, you’ll never have to worry about your PATH variable, but there are a couple situations where knowing about your PATH variable can be helpful. In particular:

  • If you downloaded a program, but you can’t run it from the command line, that probably means that its location isn’t in the PATH variable.

  • If you find that when you type a command like python, the command line isn’t running the version of python you want it to run, that’s probably because a different version of python appears earlier in the PATH variable (since the command line will stop looking through these folders as soon as it finds a match). Note that you can diagnose this problem by typing which COMMANDNAME, which will tell you the folder from which COMMANDNAME is being run.

Modifying Your PATH Variable#

How you modify your PATH variable depends a little on your operating system.

Configuration File on macOS and Linux

In Linux or macOS, the easiest way to modify your PATH variable is through your command line configuration file. This is a small script that runs in the background whenever you open a new command line window. If you add a modification to your PATH variable here, that modification will always be loaded when you open a new command line session.

The exact name of your configuration file will depend a little on what command line tool you’re using. If you haven’t changed your default terminal program (i.e., haven’t installed oh-my-zsh, as suggested in the last tutorial), your configuration file will be located in your home directory (cd ~) and is named either .bash_profile or .bashrc. Note that the name starts with a . so it’s invisible by default! You’ll have to use your ls -la trick to see and open it.

If you installed oh-my-zsh, then the file will still be located in your home directory but will now be called .zshrc.

Configuration File on Windows

If you’re using Windows, you should have created a file called .bash_profile when you installed Cmder (instructions here). That’s where you will change your PATH variable for using Bash in Cmder.

Actually Changing Your PATH Variable

When modifying your PATH variable, you really don’t want to remove anything already in your PATH variable (because who knows what program may need one of those obscure directories). Instead, the best practice is to just prefix the folders you want searched first. If your program isn’t on your PATH, this will add the program; if the wrong version of a program is being used, because you’re adding to the front of the PATH variable, the folder you add will have higher priority.

So to add a folder to the front of your PATH variable while keeping the old folders at the back, we type:

export PATH="/NEW/FOLDER/ON/PATH:$PATH"

Command Line Exercises#

Let’s do some exercises! Unless you’re in my Duke class, in which case please do not do these before class as we’ll be working on them together.

Advanced Command Line Exercises