How Works Linux

From Linux - Help
Jump to navigation Jump to search
Funny.png

PAGE WORK IN PROGRESS

Goal: The objective is to get a good start with the Linux world, if you want to learn and pass the LPI examens, you should start with this before using the LPI books. It comes from the book How Linux Works 2ND Edition from no starch press, you can buy it from here, please buy the book if your learned from here to support the writter and editor of the book.

The Big Picture

Thebigpicture.jpg


Level and layers (classification or grouping)

A layer or level is a classification of a component according to where that component sit between the user and the hardware. Web browser, games, and such sit at the top layer; at the bottom layer we have the memory in the computer hardware the 0's and 1's.

A Linux system has three main levels:

 * User processes: GUI, Shell
 * Linux kernel: system calls, process management, memory management and device drivers.
 * Hardware: CPU, RAM, Disks, network ports.
Figure3.jpg

The hardware is at the base. Hardware includes the memory as well as one or more central processing units (CPU) to perform computation and read from and write to memory. The next level up is the kernel, which is the core of the operation system (OS). The kernel is software residing in memory that tells the CPU what to do. The kernel manages the hardware and act primarily as an interface between the hardware and any running program. Processes — the running programs that the kernel manages — collectively make up the system's upper level, called user space (or user process).

There is a critical difference between the ways that the kernel and user process run: the kernel runs in kernel mode, and the user processes run in user mode. Code running in kernel mode has unrestricted access to the processor and main memory (is called kernel space). User mode, in comparison, restrict access to a (quitte small) subset of memory and safe CPU operations. User space refers to the part of main memory that the user processes can access. If a process makes a mistakes and crashes (ex.: firefox), the consequences are limited and can be cleaned up by the kernel.

Hardware: understanding main memory

Of all hardware on a computer system, main memory is perhaps the most important. In its most raw form, main memory is just a big storage area for a bunch of 0's and 1s. Each 0 or 1 is called a bit. This is where the running kernel and process reside. A CPU is just an operator on memory; it reads its instructions and data from the memory and writes data back out the memory.

The kernel

One of the kernel's task is to split memory into many subdivisions, and it must maintain certain state information about these subdivisions at all times. Each process gets its own share of memory, and the kernel must ensure that each process keeps to its share. The kernel is in charge of managing tasks in four general system areas:

 * Processes: The kernel is responsible for determining which processes are allowed to use the CPU.
 * Memory: The kernel needs to keep track of all memory (allocated, shared and free).
 * Device drivers: The kernel acts as an interface between hardware (disk) and processes.
 * System calls and support: Processes normally use system calls to communicate with the kernel.

Process management

Process management describes the starting, pausing, resuming, and terminating of process. On any modern OS, many processes run "simultaneously". For example, you might have a webbrowser and a spreadsheet open on a desktop at the same time. However, things are not as they appear: the processes behind the app typically do not run at exactly the same time. A system with one CPU, many processes may be able to use the CPU, but only one process may actually use the CPU at any given time. In practice, each process use the CPU for a small fraction of a second. The act of one process giving up control of the CPU to another process is called a context switch. Each peace of time (called a time slice) gives a process enough time for significant computation (and indeed, a process often finishes its current task during a sigle slice). The kernel is responsible for context switching. To understand how this work, let's think about a situation in which a process is running in user mode but its time slice is up. There's what happens:

 * 1. The CPU interrupts the current process based on an internal timer, switches into kernel mode, and hands control back to the kernel. 
 * 2. The kernel records the current state of the CPU and memory, which will be essentials to resuming the process that was just interrupted. 
 * 3. The kernel perform any tasks that might have come up during the preceding time slice (such as collecting data from input and output or I/O operations).
 * 4. The kernel is now ready to let another process run. The kernel analyzes the list of processes that are ready to run and choose one. 
 * 5. The kernel prepares the memory for this new process, and then prepares the CPU. 
 * 6. The kernel tells the CPU how long the time slice for the new process will last. 
 * 7. The kernel switches the CPU into user mode and hands control of the CPU to the process. 

The context switching answer the important question of when the kernel runs. The answer is that it runs between process time slice during a context switching.

Memory Management

Because the kernel must manage memory during a context switching, it has complex job of memory management, The kernel's job is complicated because the following conditions must be hold:

 * The kernel must have ist own private area in memory that user processes can't access. 
 * Each user process needs its own section of memory. 
 * One user process may not access the private memory of another process.
 * User processes can share memory.
 * Some memory in user processes can be read only. 
 * The system can use more memory than is physically present by using disk space as auxiliary (swap). 

Device Drivers and Management

The kernel's role with devices is pretty simple. A device is typically accessible only in kernel mode because improper access (such as a user process asking to turn off the power). Could crash the machine. Another problem is that different device rarely have the same programming interface, even if the devices do the same thing, such as two different network cards. Therefore device driver have tradiotionally been part of the kernel, and they strive to present a uniform interface to user process in order to simplify the software developer's job.

System calls and support

There are several other kinds of kernel features available to user processes. For ex.: system calls (or syscalls) perform specific tasks that a user process alone cannot do well or at all. For ex.: the acts of opening, reading, and writing files all involve system calls.

Two system calls, fork() and exec(), are important to understanding how processes start-up:

 * Fork(), when a process calls fork(), the kernel creates a nearly identical copy of the process. 
 * Exec(), when a process calls exec(program), the kernel starts program, replacing the current process. 

Exemple for ls:

 shell = fork() = shell
                = copy of shell = exec(ls) = ls

The kernel also support user processes with features other than traditional system calls, the most common of which are pseudodevices. Pseudodevices look like devices to user processes, but they're implemented purely in software. As such, they don't technically need to be in the kernel, but they are usually for practical reasons. For ex.: the kernel random number generator device (/dev/random) would be difficult to implement.

User space

The main memory that the kernel allocates for user processes is called user space. Because a process is simply a state (or image) in memory, user space also refers to the memory for the entire collection of running processes (userland). Most of the real action on a Linux system happens in user space. Although all processes are essentially equal from the kernel's point of view, they perform different tasks for users. There is a rudimentary service level (or layer) structure to the kinds of system components that user processes represent.

Basic services are at the bottom level (closest to the kernel), utility services are in the middle, and application that user touch are at the top.

Simplified diagram:

User-space-vs-kernel-space-basic-system-calls.png

Users

The Linux kernel supports the traditional concept of a Unix user. A user is an entity that can run processes and own files. A user is associated with a username. For ex.: a system could have a user named jamesbond. However, the kernel does not manage the usernames, instead, it identifies users by numeric identifiers called userid.

A user may terminate or modify the behaviour of its own processes (within certain limits), but it cannot interfere with other user processes. In addition, users may own files and choose wheter they share them with other users.

An other user to know is root. The root user is an exception because root may terminate and alter another user processes and read any file on the local system. For this reason, root is know as the superuser.

Sudo power 1.jpg

Basic commands and directory hierarchy

Standard input (STDIN) and standard output (STDOUT)

Unix processes use I/O streams to read and write data. Processes read data from input streams and write data to output streams. Streams are very flexible. For ex.: the source of an input stream can be a file, a device, a terminal, or even the output stream from another process.

To see an input stream at work, enter cat (with no file) and press ENTER.

$ cat  # press ctrl-D when you have finish to test the STDIN. 

This time, you won't get your shell prompt back because cat is still running. Now type anything and press ENTER at the end of each line. The cat command repeats any line that your type.

Note: Pressing ctrl-D on an empty line stops the current STDIN entry from the terminal (and often terminates a program). Don't confuse this with ctrl-C, which terminates a program regardless of the STDIN or STDOUT.

STDOUT is similar. The kernel gives each process a STDOUT stream where it can write its output. The cat command always writes its output to the STDOUT. When you ran cat in the terminal, the STDOUT was connected to that terminal, so that's where you saw the output. Many command operate as cat does, if you don't specify an input file, the command reads from STDIN. Output is a little different, Some commands (like cat) send output only to STDOUT, but others have the option to send output directly to files.

Index001.png

Basic commands

 ls    - list the content of a directory.
 cp    - copie files.
 mv    - renames files and move files. 
 touch - creates a file.
 rm    - remove files (delete).
 echo  - prints its arguments to STDOUT.
 cd    - change directory. 
 mkdir - creates new directory.
 rmdir - removes the directory. 

Shell Globing (wildcards)

 *  tells to the shell to match any number of arbitrary characters. 
 ?  tells to the shell to match exactly one arbitrary character. 

Remarque: If you don't want to the shell to expand a glob in a command, enclose the glob in single quotes ().

Note: It is important to remember that the shell performs expansions before running commands, and only then, Therefore, if a * makes it to a command without expanding, the shell will do nothing more with it; it's up to the command to decide what it wants to do.

Intermediate commands

 grep     - prints the line from a file or STDIN that match an expression. 
 grep -i  # for case-insensitive.
 grep -u  # inverse the search.

There is also a more powerful variant called egrep (which is just a synonym for grep -e).

Grep understands patterns known as regular expressions that are grounded in computer science theory and are very common in Unix utilities. Regular expressions are more powerful than wildcard-style patterns, and they have a different syntax. There are two important things to remember about regular expressions:

 * matches any number of characters (like * in wildcards).
 . matches one arbitrary character (like ? in wildcards).
 less - opposite of more.

space-bar to go forward and b to skip back, q to quite.

Note: The less command is an enhanced version of an older program named more. Most Linux desktops and server have less, but it's not standard on many embedded systems and other Unix system. So if you ever run into a situation when you can't use less, try more.

You can also search for text inside less. For ex.: to search forward for a word, type /word, and to search backward, use ?word. When you find a match, press n to continue searching.

Here is an example of sending STDOUT of grep command to less command:

 grep 'ie' /usr/share/dict/words | less
 pwd    - print working directory.
 pwd -P # avoid all symlinks.
 diff   - see the differences between two text files.
 file   - determine file type.
 find   - find a file.
 locate - locate a file # same as file, faster but works only with a database.

The find command accepts special pattern-matching characters such as *, but you must enclose then in single quotes ('*') to protect the special characters from the shell's own globing features.

 head   - output the first part of files.
 tail   - output the last part of files.

Note: To change the number of line to display, use the -n option, where n is the number of lines to display.

  $ head -5 /etc/passwd

To print lines starting at line n, use tail +n.

 sort   - sort lines of text files (in alphanumeric order). 

Note: If the file line start with numbers use the -n option. -R option reverse the order.

Dot files

The configuration files are called dot files.

Note: You can run into problems with globs because .* makes . and .. . You may wise to use patterns such as .[^.]* or .??* to get all dot files except the current and parent directories.

Environment and shell variables

The shell can store temporary variables, called shell variables, containing the value of text strings. Shell variables are very useful for keeping track of values in script, and some shell variables control the way the shell behaves. To assign a value to a shell variable, use the equal sign (=). Here's a simple example:

 TEST=/usr/bin/sh

To access this variable use the $ sign:

 echo $TEST

An environment variable is like a shell variable, but it's not specific to the shell. All processes on Unix systems have environment variable storage. The main difference between environment variables (env variables) and shell variables is that the OS passes all of your shell's env variables to programs that the shell runs, whereas shell variables cannot be accessed in the commands that you run.

Assign an environment variable with the shells's export command. For ex.: if you'd like to make $TEST shell variable into an environment variable, use the following:

 export TEST

Environment variables are useful because many programs read them for configuration and options. For ex.: you can put your favorite less option in the LESS env variable, and less will use these options when you run it.

The command PATH is a special env variable that contains the command path (or path for short). A command path is a list of system directories that the shell searches when trying to locate a command. For ex.: by using this command, you can add a directory to the beginning of the path so that the shell looks in dir before looking in any of the other PATH directories:

 PATH=dir:$PATH

Or you can append a dir name to the end of the PATH variable, causing the shell to look in dir last:

 PATH=$PATH:dir

Special characters

 *  asteriks, star  Regular expression, glob character.
 .  dot             Current directory, file/hostname delimiter. 
 !  bang            Negation, command history. 
 |  pipe            Command pipe. 
 /  forward slash   Directory delimiter, search command. 
 \  back slash      Literals, macros (never dir).
 $  dollar          Variable denotation, end of line. 
 `  tick            Literal strings. 
 "  double quote    Semi-literal strings. 
 ^  caret           Negation, beginning of line. 
 ~  tilde           Negation, dir shortcut. 
 #  hash            Comments, preprocessor, substitutions. 
 [] square          Ranges. 
 {} brace           Statement blocks, ranges.
 _  underscore      Cheap substitute for space.  

Note: You will often see control characters marked with caret, for ex.: ^C for ctrl-C.

Command line editing

 ctrl-B  Move the cursor left.
 ctrl-F  Move the cursor right.
 ctrl-P  View the previous command.
 ctrl-N  view the next command.
 ctrl-A  Move the cursor to the beginning of the line.
 ctrl-E  Move the cursor to the end of the line.
 ctrl-W  Erase the preceding word. 
 ctrl-U  Erase from the cursor to the beginning of the line.
 ctrl-K  Erase from the cursor to the end of the line.
 ctrl-y  Past erased text (ex.: ctrl-U).
 ctrl-R  search in your history. 

Shell input and output

To send the output of command line to a file instead of the terminal, use the > redirection character:

 echo 'hello world' > STDOUT.txt

The shell creates file if it does not already exist. If file exists, the shell erases the original file first. You can append the output to the file instead of overwriting it with the >> redirection syntax:

 echo 'hello world' >> STDOUT.txt

To send the STDOUT of a command to the STDIN of another command, use the pipe character |:

 $ head /proc/cpuinfo | tr a-z A-z

Standard error

Occasionally, you may redirect STDOUT but find that the program still prints something to the terminal. This is called standard error (STDERR); it's an additional STDOUT for diagnostics and debugging. For ex.: this command produces an error:

 $ ls /fffff > STDOUT.txt

You can redirect the STDERR if you like. For ex.: use the ?2> syntax, like this:

 $ ls /fffff > STDOUT.txt 2> STDERR.txt

The number 2 specifies the stream ID that the shell modifies. Stream ID 1 is STDOUT (default) and 2 is STDERR. You can also send the STDERR to the same place as STDOUT with the >& notation. For ex.:

 $ ls /fffff > STDOUT.txt 2>&1

Standard input redirection

To channel a file to a program's STDIN, use the < operator. For ex.:

 # prints out lines in myfile containing the word foo:
 $ grep foo < myfile 

Listing and manipulating processes

Each process on the system has a numeric process ID (PID). For a quick listing of running process, just run ps on the command line. The fiels are as follow:

 $ ps
    PID TTY          TIME CMD
   5112 pts/0    00:00:00 bash
   5120 pts/0    00:00:00 ps
 #PID  - Process ID.
 #TTY  - Terminal device where the process is running.
 #STAT - Process status (ex.: S = sleepig, R = running). 
 #Time - The amount of CPU time in minutes and seconds that the process has used so far. 


Command options

The ps command has many option. To make things more confusing, you can specify options in three different style — Unix, BSD, and GNU. We'll use the BSD style. Here are some of the most useful option combinations:

 ps x  - Show all of your running processes. 
 ps ax - Show all processes on the system. 
 ps u  - Include more detailed info on processes. 
 ps w  - Show full command names, not just what fits on one line. 

As with other programs, you can combine options, as in ps aux and ps auxw. To check on a specific process, add its PID to the argument list of the ps command. For ex.: to inspect the current shell process, you could use ps u $$ because $$ is a shell variable that evaluates to the current shell PID.

Killing processes

Kill-all.jpg

To terminate a process. send it a signal with the kill command. A signal is a message to a process from the kernel. When you rune kill, you are asking to the kernel to send a signal to another process.

 $ kill PID

There are many types of signals. The default is -TERM, or terminate. You can send different signals by adding an extra option to kill. For ex.: to freeze a process instead of -TERM use the signal -STOP.

 $ kill -STOP PID

A stopped process is still in memory, ready to pick up where it left off. Use the -CONT signal to continue running the process again:

 $ kill -CONT pid 

Note: Using ctrl-C to terminate a process that is running in the current terminal is the same as using fill to end the process with the INT (interrupt) signal. The most brutal way to terminate a process is -KILL signal. Other signals give the process a chance to clean up after it self, but -KILL does not. The OS terminates the process and forcibly removes it from memory. Use this as a last resort.

Job Control

Shell also support job control, which is a way to send TSTP (similar to -STOP) and -CONT signals to programs by using various keystrokes and commands. For ex.: use ctrl-Z, then start again the process by enterring bg (for background) and fg (foreground).

Hint: To see if you've accidentally suspend any process on your current terminal run the jobs command.

File modes and permissions

Every Unix file has a set of permissions that determine whether you can read, write, or run the file. Running ls -l display the permissions. Here is an example of such a display:

 $ ls -l
 -rwxr-xr-x 4 user group        45 Oct  9 17:03  foo.txt

The file's mode represents the file's permission and some extra information. There are four parts to the mode, as illustrated here above. The first character of the mode d is the file type. A dash - in this position, as in the example, denotes a regular file, meaning that there's nothing special about the file. This is by far the most common kind of file. Directories are also common and are indicated by a d in the file type slot.

 -   Type of file. 
 rwx user permissions.
 r-x group permissions.
 r-x other permissions. 

The rest of a file's mode contains the permissions, which break down into three sets: user, group, and other, in that order:

 r means that the file is readable. 
 w means that the file is writable. 
 x means that the file is executable. 
 - means nothing. 

The user permissions pertain to the user who owns the file. The group permissions, are for the file's group. Any user in that group can take advantage of these permissions. Everyone else on the system has access according to the third set, the other permissions, which are sometimes called world permissions.

Note: Each read, write and execute permission slot is sometimes called a permission bit. Therefore, you may hear people refer to parts of the permissions as "the read bits".

Some executable files have an s in the user permissions listing instead of an x. This indicates that the executable is setid, meaning that when you are execute the program, it runs as though the file owner is the user instead of you. Many programs use this setid bit to run as root in order to get the privileges they need to change system files. One example is the passwd program, which needs to change the /etc/passwd file.

For ex.:

 $ locate passwd | less
 /bin/passwd
 $ ls -l /bin/passwd
 -rws--x--x 1 root root 52224 Sep  7 08:04 /bin/passwd

Modifying permissions

To change permissions, use the command chmod. First, pick the set of permissions that you want to change, and the pick the bit to change.

For ex.: to add group (g) and world (o), read (r) permissions to foo.txt, you could run this command:

 $ chmod g+r foo.txt
 $ chmod o+r foo.txt
 #or in one shot:
 $ chmod go+r foo.txt

To removes these permissions, use go-r instead of go+r.

You may sometimes see people changing permissions with numbers, for ex.:

 $ chmod 644 foo.txt

This is called an absolute change because it sets all permissions bits at once. To understand how it works you need to know how to represent the permission bits in octal form.

Chmod-infographic-001.png

Directories also have permissions. You can list the contents of a directory if it's readable, but you only can access a file in a directory if the directory is executable. Finnaly, you can specify a set of default permissions with the umask shell command. Which applies a predefined set op permissions to any new file you create.

 $ umask 022 - If you want everyone to be able to see all of the files. 
 $ umaks 077 - If you don't want. 

The octal umasks are calculated via the bitwise AND of the unary complement of the argument using bitwise NOT. The octal notations are as follows:

 Octal value : Permission
 0           : read, write and execute.
 1           : read and write.
 2           : read and execute.
 3           : read only.
 4           : write and execute.
 5           : write only.
 6           : execute only.
 7           : no permissions.

Now, you can use above table to calculate file permission. For example, if umask is set to 077, the permission can be calculated as follows:

 Bit	Targeted at	File permission
 0	Owner	read, write and execute.
 7	Group	No permissions.
 7	Others	No permissions.

The umask command be used for setting different security levels as follows:

 umask value	        Security level	Effective permission (directory)
 022	Permissive	755
 026	Moderate	751
 027	Moderate	750
 077	Severe	        700

For more information about the umask read the man page of bash or ksh or tcsh shell:

 $ man bash
 $ help umask

Symbolic links

A symbolic link is a file that points to another file or directory, effectively creating an alias (like a shortcut in Windows). Symbolic links offer quick access to obscure directory paths. In a long directory listing, symbolic links look like this (notice the l as the file type in the file mode):

   lrwxrwxrwx  1 root root   20 Oct 27 12:28 linux -> linux-4.12.12-gentoo

If you try to access linux in this directory, the system gives you linux-4-12-12-gentoo instead. Symbolic links are simply names that point to other names. Their names and the paths to which they point don't have to mean anything. For ex.: linux-4.12.12-gentoo does not need to exist. In fact, if linux-4.12.12-gentoo does not exist, any program that accesses linux reports that linux does not exist (except for ls linux, a command that stupidly informs that linux is linux). This can be belling because you can see something named linux right in front of your eyes. This is not the only way that symbolic links can be confusing. Another problem is that you can't identify the characteristics of a link target just by looking at the name of the link; you must follow the link to see if it goes to a file or directory. Your system may also have links that point to other links, which are called chained symbolic links.

Creating symbolic links

To create a symbolic link from target to linkname, use ln -s:

 $ ln -s /etc/ssh/sshd_config sshdconfig

The link name argument is the name of the symbolic link (as in ex.: sshdconfig), the target argument is the path of the file or directory (as in ex.: /etc/ssh/sshd_config) that the link points to, and the -s option specifies a symbolic link. When making a symbolic link, check the command twice before you run it because several things can go wrong. If something goes wrong when you create a symbolic link to a directory, check that directory for errant symbolic links and remove them. Symbolic links can also cause headaches when you don't know that they exist. For ex.: you can easily edit what you think is a copy of a file but is actually a symbolic link to the original.

Warning: Don't forget the -s option when creating a symbolic links. Without it, ln creates a hard link, giving an additional real filename to a single file. The new filename has the status of the old one; it points (links) directly to the file data instead of another filename as a symbolic links does. Hard links can be even more confusing than symbolic links.

Fig15-07.jpg

With all of these warnings regarding symbolic links, why would anyone bother to use them? Because they offer a convenient way to organize and share files, as well as patch up small problems.

Archiving and compressing files

gzip

The program gzip (GNU zip) is one of the current standard Unix compression programs. A file that ends with .gz is a GNU zip program. Use gunzip filename.gz to uncompress file.gz and removes the suffix; to compress it again, use gzip file. For ex.: zip and unzip the files located in your /home/user/Downloads.

 $ gzip -r /home/user/Downloads/ # -r option is for recursive mode (see man gzip). 
 $ ls -a Downloads/ | grep gz
 firmware-iwlwifi_0.43_all.deb.gz
 François Pirette - Cabinet du Ministre Daerden.mp4.gz
 gparted-live-0.29.0-1-amd64.iso.gz
 Pirette en ministre flamand.mp4.gz
 profanity-master.zip.gz
 $ gunzip -r /home/user/Downloads/

tar

Unlike the zip programs for other OS, gzip does not create archives of files. To create an archive, use tar instead:

 $ tar cvf archive.tar file 1 file1 ...

Archives created by tar usually have a .tar suffix. For ex.: in the command above, file 1, file 2 and so one are the name of the files and directories that you wish to archive in archive.tar.

 -c create an archive. 
 -v verbose output (want more? Use -vv).
 -f file option. You must use this option followed by a filename at all times, except with tape drives. 

Unpacking tar files:

To unpack a tar file with tar use the option x:

 $ tar xvf archive.tar # x option puts tar into extract mode. 
  • You can extract parts of the archive by entering the names of the parts at the end of the command line, but you must know their exact names.

For ex.: I create a tar file the will extract only one of the file in the archive as follow:

 # create tar file.
 $  tar cvfp test.tar Downloads/
 Downloads/
 Downloads/torrent/
 Downloads/profanity-master/
 Downloads/profanity-master/apidocs/
 Downloads/profanity-master/apidocs/c/
 Downloads/profanity-master/apidocs/c/c-prof.conf
 Downloads/lib/
 Downloads/lib/firmware/
 Downloads/lib/firmware/iwlwifi-1000-5.ucode
 Downloads/lib/firmware/iwlwifi-6050-5.ucode
 Downloads/lib/firmware/intel/
 Downloads/lib/firmware/intel/ibt-hw-37.7.10-fw-1.80.2.3.d.bseq
 Downloads/lib/firmware/intel/ibt-hw-37.7.bseq
 Downloads/usr/
 Downloads/usr/share/
 Downloads/usr/share/doc/
 Downloads/usr/share/doc/firmware-iwlwifi/
 Downloads/usr/share/doc/firmware-iwlwifi/changelog
 Downloads/usr/share/doc/firmware-iwlwifi/copyright
 Downloads/usr/share/bug/
 Downloads/usr/share/bug/firmware-iwlwifi/
 Downloads/usr/share/bug/firmware-iwlwifi/presubj
 Downloads/gparted-live-0.29.0-1-amd64.iso
 Downloads/François Pirette - Cabinet du Ministre Daerden.mp4
 Downloads/Pirette en ministre flamand.mp4
 Downloads/profanity-master.zip
 Downloads/firmware-iwlwifi_0.43_all.deb
 # unpack to /home/user/Downloads the file profanity-master.zip
 $  tar xvfp test.tar Downloads/profanity-master.zip 
 Downloads/profanity-master.zip

Note: When using extract mode, remember that tar does not remove the archived tar file after extracting it's content.

With tar without using another command, you can bzip2, xz and gzip files with the follow three options j, J and z:

 -j, --bzip2 - Filter the archive through bzip2(1).
 -J, --xz    - Filter the archive through xz(1).
 -z, --gzip, - Filter the archive through gzip(1).

For ex.:

 $ tar cvfj test.tar.bz2 Downloads/
 or
 $ tar cvfJ test.tar.xz Downloads/
 or
 $ tar cvfz test.tar.gz Downloads/  

Linux directory hierarchy essentials

The details of the Linux directory structure are outlined in the File-system Hierarchy Standard (or FHS), but a brief walkthrough should suffice for now.

Flowchart-LinuxDirectoryStructure.png

Here are the most important sub-directories in root:

 /bin  Contains ready-to-run programs (also know as an executable), including most of the basic commands such as ls and cp. Most of the programs in /bin are in binary format, having been created by              
       a C compiler, but some are shell scripts in modern system. 
 /dev  Contains device files. 
 /etc  This core system configuration directory (pronounced EHT-see) contains the user password, boot, device, networking, and other setup files. Many items in /etc are specific to the machine's hardware. 
       For ex.: the /etc/X11  directory contains graphics card and window system configurations. 
 /home Holds personal directories for regular users. Most Unix installations conform to this standard. 
 /lib  An abbreviation for library, this directory holds library files containing code that executable can use. There are two type of libraries: static and shared. The /lib directory should contain only 
       shared libraries, but other lib directories, such as /usr/lib, contain both varieties as well as other auxiliary files.   
 /proc Provides system statistics through a browsable directory-an-file interface. Much of the /proc sub-directory structure on Linux is unique, but many other Unix variants have similar features. 
       The /proc directory contains information about currently running processes as well as some kernel parameters. 
 /sys  This directory is similar to /proc in that it provides a device and system interface. 
 /sbin The place for system executable. Programs in /sbin directories relate to system management, so regular users usually do not have /sbin components in their command path. Many of the utilities found 
       here will not work if you're not running them as root (ex.: mkfs.ext4 program).
 /tmp  A storage area for smaller, temporary files that you don't care much about. Any user may read to and write from /tmp, but the user may not have the permission to access another user's files there. 
 /usr  Although pronounced "user", this sub-directory has no user files. Instead, it contains a large directory hierarchy, including the bulk of the Linux system. Many of the directory names in /usr 
       are the same as those in the root directory (like /usr/bin and /usr/lib), and they hold the same type of files.
 /var  The variable sub-directory, where programs record run-time information. System logging, user tracking, caches, and other files that system programs create and manage are here.   

Other root sub-directories

There are a few other interesting sub-directories in the root directory:

 /boot  Contains kernel boot loader files. These files pertain only to the very first stage of the Linux startup procedure; you won't find information about how Linux start up its 
        services in this directory. 
 /media A base attachment point for removable media such as flash drives.
 /opt   This may contain additional third-part software. Many systems don't use /opt.

The /usr directory

The /usr directory may look relatively clean at first glance, but a quick look at /usr/bin and /usr/lib reveals that there's a lot here: /usr is where most of the user-space programs and data reside. In addition to /usr/bin, /usr/sbin. and /usr/lib, /usr contains the following:

 /include Holds header files used by the C compiler. 
 /info    Contains GNU info manuals. 
 /local   Is where administrators can install their own software. It's structure should look like that of / and /usr.
 /man     Contains manual pages. 
 /share   Contains files that should work on other kinds of Unix machines with no loss of functionality. 

Kernel location

On Linux systems, the kernel is normally in /vmlinuz'or /boot/vmlinuz. A boot loader loads this file into memory and sets it in motion when a system boots. Once the boot loader runs and sets the kernel in motion, the main kernel file is no longer used by the running system. However, you'll find many modules that the kernel can load and unload on demand during the course of normal system operation. Called loadable kernel modules, they are located under /lib/modules

Running commands as the superuser

Before going any further, you should learn how to run commands as the superuser. You probably know that you can run the su command and enter the root password to start a root shell. This practice works, but it has certain disadvantages:

 * You have no record of system-altering commands. 
 * You have no record of the users who performed system-altering commands. 
 * You don't have access to your normal shell environment. 
 * You have to enter the root password. 

sudo

Most larger distributions use a package called sudo to allow administrator to run commands as root when they are logged in as themselves. For ex.: to edit /etc/passwd with the command vipw:

 $ sudo vipw

When you run this command, sudo logs this action with syslog service under the local2 facility.

/etc/sudoers

Of course, the system doesn't let just any user run commands as the superuser, you must configure the privileged users in your /etc/sudoers file. The sudo package has many options, which makes the syntax in /etc/sudoers somewhat complicated. For ex.: this file gives uers1 and user2 the power to run any command as root without having to enter a passwd:

 $ sudo visudo
 User_Alias ADMINS = user1, user2
 ADMINS ALL = NOPASSWD: ALL
 root ALL=(ALL) ALL

The first line defines an ADMINS user alias with the two users, and the second line grants the privileges. The ALL = NOPASSWD: ALL part means that the users in the ADMINS alias can use sudo to execute commands as root. The second ALL means "any command." The first ALL means "any host." If you have more than one machine, you can set different kinds of access for each machine or group of machines. The root ALL=(ALL) ALL simply means that the superuser may also use sudo to run any command on any host. The extra (ALL) means that the superuser may also run commands as any other user. You can extend this privilege to the ADMINS users by adding (ALL) to the /etc/sudoers line as shown here:

 ADMINS ALL = (ALL) NOPASSWD: ALL

Note: Use the visudo command to edit /etc/sudoers. This command checks for file syntax errors after you save the file.

If need more information about sudo see man pages:

 $ man sudo
 $ man sudoers

Devices

Device files

It is easy to manipulate most device on a Unix system because the kernel presents many of the device I/O interface to user processes as files. These device files are sometimes called device node. Not only can a programmer use regular file operations to work with a device, but some devices are also accessible to standard programs like cat, so you don't have to be a programmer to use a device. However, there is a limit to what you can do with a file interface, so not all devices or device capabilities are accessible with standard file I/O. Linux uses the same design for devices files as do other Unix flavors. Devices files are in the /dev directory, and running ls /dev reveals more than a few files in /dev. So how do you work with devices?

To get started, consider this command:

 $ echo 'hello world' > /dev/null

As does any command with redirected output, this sends hello world from STDOUT to a file. However, the file is /dev/null, a device, and the kernel decides what to do with any data written to this device. In this case of /dev/null, the kernel simply ignores the input and throws away the data. To identify a device and view its permissions, use:

 $ ls -l /dev | grep sda
 brw-rw---- 1 root  disk    8,   0 Nov  1 11:58 sda
 brw-rw---- 1 root  disk    8,   1 Nov  1 11:58 sda1
 brw-rw---- 1 root  disk    8,   2 Nov  1 11:58 sda2
 brw-rw---- 1 root  disk    8,   3 Nov  1 11:58 sda3

Note the first character of each line (the first character of the file's mode). If this character is b,c,p, or s, the file is a device. These letters stand for block, character, pipe, and socket, respectively, as described in more below:

 Block device:
   Programs access data from a block device in fixed chunks. The sda1 in the preceding example is a disk device, a type of block device. Disks can be easily split up into blocks of data. Because a block device's total size 
   is fixed and easy to index, processes have random access to any block in the device with the help of the kernel. 
 Character device:
   character devices work with data streams. You can only read characters from or write characters to characters devices, as previously demonstrated with /dev/null. Character devices don't have a size; when you read from or write 
   to one, the kernel usually performs a read or write operation on the device. Printers directly attached to your computer are represented by character devices. It's important to note that during character device interaction, the  
   kernel cannot back up and reexamine the data stream after it has passed data to a device or process.   
 Pipe device:
   Named pipes are like character devices, with another process at the other en of the I/O stream instead of a kernel driver. 
 Socket device:
   Sockets are special-purpose interfaces that are frequently used for inter-process communication. They're often found outside of the /dev/ directory. Socket files represent Unix domain sockets. 

The number before the dates in the first four lines are the major and minor device numbers that help the kernel identify the device. Similar devices usually have the same major number, such as sda1, and sdb2 (both of which are hard disk partitions).

Note: Not all devices have device files because the block and character device I/O interfaces are not appropriate in all cases. For ex.: network interfaces don't have device files. It is theoretically possible to interact with a network interface using a single character device, but because it would be exceptionally difficult, the kernel uses other I/O interfaces.

The sysfs device path

The traditional Unix /dev directory is a convenient way from user processes to reference and interface with devices supported by the kernel, but it's also a very simplistic scheme, but not a lot. Another problem is that the kernel assigns devices in the order in which they are found, so a device may have different name between reboots.

To provide a uniform view for attached devices based on their actual hardware attributes, the Linux kernel offers the sysfs interface through a system of files and directories. The base path for devices is /sys/devices. For ex.: the SATA hard disk at /dev/sda might have the following path in sysfs:

 /sys/devices/pci0000:00/0000:00:17.0/ata3/host2/target2:0:0/2:0:0:0/block/sda

As you can see, this path is quite long compared with the /dev/sda filename, which is also a directory. But you can't really compare the two paths because they have different purposes. The /dev file is there so that user processes can use the device, whereas the /sys/devices path is used to view information and manage the device. If you list the contents of a device path such as the preceding one, you'll see something like the following:

 $ cd /sys/devices/pci0000:00/0000:00:17.0/ata3/host2/target2:0:0/2:0:0:0/block/sda/
 $ ls -l
 alignment_offset  discard_alignment  holders    range      sda3       trace
 bdi               events             inflight   removable  size       uevent
 capability        events_async       integrity  ro         slaves
 dev               events_poll_msecs  power      sda1       stat
 device            ext_range          queue      sda2       subsystem

The files and sub-directories here are meant to be read primarily by programs rather than humans, but you can get an idea of what they contain and represent by looking at an example such as /dev file. Running cat dev in this directory displays the numbers 8:0, which happen to be the major and minor device numbers of /dev/sda. There are a few shortcuts in the /sys directory. For ex.: /sys/block should contain all of the block devices available on a system. However, those are just symbolic links; run ls -l /sys/block to reveal the true sysfs paths. It can be difficult to find the sysfs location of a device in /dev. Use the udevadm command to show the path and other attributes:

 $ udevadm info --query=all --name=/dev/sda | less

Note: The udevadm program is in /sbin; you can put this directory at the end of your path, if it's not already there.

dd and devices

The program dd is extremely useful when working with block and character devices. This program's sole function is to read from an input file or stream and write to an output file or stream, possibly doing some encoding conversion on the way. dd copies data in blocks of a fixed size. Here's how to use dd with a character device and some common options:

 # dd if=/dev/zero of=/new_file bs=1024 count=1

As you can see, the dd option format differs from the option formats of most other Unix commands; it;s based on an old IBM Job Control Language (JCL) style. Rather than use the dash (-) character to signal an option, you name an option and set its value to something with the equals (=) sign. The preceding example copies a single 1024-byte block from /dev/zero (a continuous stream of zero bytes) to new_file.

These are the important dd options:

 if=file              The input file. The default is STDIN. 
 of=file              The output file. The default is STDOUT. 
 bs=size              The block size. dd reads and writes this many bytes of data at a time. To abbreviate large chunks of data, you can use b and k to signify 512 and 1024 bytes, respectively. 
                      Therefore, the ex.: above could read bs=1k instead of bs=1024.
 ibs=size, obs=size   The input and output block sizes. If you can use the same block size for both input and output, use the bs option; if not, use ibs and obs for input and output, respectively. 
 count=num            The total number of blocks to copy. When working with a huge file —— or with a device that supplies an endless stream of data, such as /dev/zero —— you want dd to stop at a fixed 
                      point or you could waste a lot of disk space, CPU time, or both. Use count with  the skip parameter to copy a small piece from a large file or device. 

Warning: dd is very powerful, so make sure you know what you're doing when you run it. It's very easy to corrupt files and data on devices by making a careless mistake. It often helps to write the output to a new file if you're not sure what it will do.

Device Name Summary

It can sometimes be difficult to find the name of a device (for ex.: when partitioning a disk). Here are few ways to find out what it is:

 * Query udevd using udevadm.
 * Look for the device in the /sys directory. 
 * Guess the name from the output of the dmesg command.
 * For a disk device that is already visible to the system, you can check the output of the mount command.
 * Run cat /proc/devices to see the block and character devices for which your system currently has drivers. Each line consist of a number and name. The number is the major number of the device. Look in 
   /dev for the character or block devices with the corresponding major number. and you've found the device files.    

Among these methods, only the first is reliable, but it does require udev. If you get into a situation where udev is not available , try the other methods but keep in mind that the kernel might not have a device file for your hardware. The following sections list the most common Linux devices and their naming conventions.

Hard Disks:/dev/sd*

Most hard disks attached to current Linux systems correspond to device names with and sd prefix, such as /dev/sda, /dev/sdb, and so on. These devices represent entire disk; the kernel makes separate device files, such as /dev/sda1 and /dev/sda2, for the partitions disk. The naming convention requires a little explanation, The sd portion of the name stands for SCSI disk. Small Computer System Interface (SCSI) was originally developed as a hardware and protocol standard for communication between devices such as disks and other peripherals. Although traditional SCSI hardware isn't used in most modern machines, the SCSI protocol os everywhere due to its adaptability. For ex.: USB storage devices use it to communicate. The story on SATA disks is a little more complicated, but the Linux kernel still use SCSI commands at a certain point when talking to them. To list SCSI devices on your system, use a utility that walks the device path provided by sysfs. One of the most succinct tools is lsscsi. Here is what you can expect when you run it:

 $ lsscsi
 [2:0:0:0]    disk    ATA      SanDisk SDSSDA24 80RL  /dev/sda 
 [3:0:0:0]    disk    ATA      ST1000DM003-1SB1 CC43  /dev/sdb 

The first column identifies the address of the device on the system, the second describes what kind of device it is, and the last indicates where to find the device file. Everything else is vendor information. Linux assigns devices to device files in the order in which its drivers encounter devices. So in the previous example, the kernel found the disk ScanDisk (= /dev/sda) first and the disc ST1000DM003 (=/dev/sdb) in second position. Unfortunately, this device assignement scheme has traditionally caused problems when re-configuring hardware. Say, for ex.: that you have a system with three disks: /dev/sda, /dev/sdb, and /dev/sdc. If /dev/sdb explodes and you must remove the disk so that the machine can work again, the former /dev/sdc moves to /dev/sdb, and there is no longer a /dev/sdc. If you were referring to the device names directly in the fstab file you'd have to make some changes to that file in order to get things (mostly) back to normal. To solve this problem, most modern Linux system use the Universally Unique Identifier (UUID) for persistent disk device access.

To find the UUID of your hard disk, type:

 $ ls -l /dev/disk/by-uuid

This discussion has barely scratched the surface of how to use disks and other storage devices on Linux systems.

CD and DVD drives:/dev/sr*

Linux recognizes most optical storage drives as the SCSI devices /dev/sr0, /dev/sr1, and so on. However, if the drive uses an older interface, it might show up as a PATA device, as discussed below. The /dev/sr* devices are read only, and they are used only for reading from discs. For the write and rewrite capabilities of optical devices, you'll use the "generic" SCSI devices such as /dev/sg0.

PATA Hard Disks (or Parallel ATA):/dev/hd*

The Linux block devices /dev/hda, /dev/hdb, /dev/hdc, and so one, are common on older versions of the Linux kernel and with older hardware. These are fixed assignments based on the master and slave devices on interfaces 0 and 1. At times, you might find a SATA drive recognized as one of these disks. This means that the SATA drive is running in a compatibility mode, which hinders performance. Check your BIOS settings to see if you can switch the SATA controller to its native mode.

Terminals:/dev/tty*, /dev/pts/*, and /dev/tty

Terminals are devices for moving characters between a user process and an I/O device, usually for text output to a terminal screen. The terminal device interface goes back a long way, to the days when terminals were typewriter-based devices. Pseudo-terminal devices are emulated terminals that understands the I/O features of real terminals. But rather than talk to a real piece of hardware, the kernel presents the I/O interface to a piece of software, such as the shell terminal window that you probably type most of your commands into. Two common terminal devices are /dev/tty1 (the first virtual console) and /dev/pts/0 (the first pseudo-terminal device). The /dev/pts directory itself is a dedicated filesystem. The /dev/tty device is the controlling terminal of the current process. If a program is currently reading from and writing to a terminal, this device is a synonym for that terminal. A process does not need to be attached to a terminal.

Display modes and Virtual consoles

Linux has two primary display modes: text mode and an X Window system server (graphics mode, usually via a display manager). Although Linux systems traditionally booted in text mode, most distribution now use kernel parameters and interim graphical display mechanisms (boot-splashes such as plymouth) to completely hide text mode as the system is booting. In such cases, the system switches over to full graphical mode near the end of the boot process.

Linux support virtual consoles to multiplex the display. Each virtual console may run in graphics or text mode. When in text mode, you can switch between consoles with ALT-Function key combination —— For ex.: ALT-F1 takes you to /dev/tty1, and so on. Many of these may be occupied by a getty process running a login prompt. A virtual console used by X server in graphics mode is slightly different. Rather than getting virtual console assignment from the init configuration , an X server takes over a free virtual console unless directed to use a specific virtual console. For ex.: if you have getty processes running on ty1 and tty2, a new X server takes over tty3. In addition, after the X server puts a virtual console into graphics mode, you must normally press a CTRL-ALT-Function key combination to switch to another virtual console instead of the simpler ALT-Function key combination. The upshot of all of this is that you want to see your text console after your system boots, press CTRL-ALT-F1. To return to the X11 session, press ALT-F2, ALT-F3, and so on, until you get to the X session. If you run into trouble switching consoles due to a malfunctioning input mechanism or some other circumstance, you can try to force the system to change consoles with the chvt command. For ex.: to switch to tty1, run the following as root:

 $ sudo chvt 1

Serial ports:/dev/ttyS*

Older RS-232 type and similar serial ports are special terminal devices. You can't do much on the command line with serial port devices because there are too many settings to settings to worry about, such as baud rate and flow control. The port know as COM1 on Windows is /dev/ttyS0; COM2 is /dev/ttyS1; and so on. Plug-in USB serial adapters show up with USB and ACM with the names /dev/ttyUSB0; /dev/ttyACM1, and so on.

Parallel ports:/dev/lp0 and /dev/lp1

Representing an interface type that has largely been replaced by USB, the unidirectional parallel port devices /dev/lp0 and /dev/lp1 correspond to LPT1: and LPT2: in Windows. You can send files (such as a file to be printed) directly to parallel port with the cat command, but you might need to give the printer extra form feed or reset after-ward. A print server such a CUPS is much better at handling interaction with a printer. The bidirectional parallel ports are /dev/partport0 and /dev/partport1.

Audio devices:/dev/snd/*, /dev/dsp, /dev/audio, and more

Linux has two set of audio devices. There are separate devices for the Advances Linux Sound Architecture (ALSA) system interface and the older Open Sound System (OSS). The ALSA devices are in the /dev/snd directory, but it's difficult to work with them directly. Linux systems that use ALSA support OSS backward-compatible devices if the OSS kernel support is currently loaded. Some rudimentary operations are possible with the OSS dsp and audio devices. For ex.: the computer plays any WAV file that you send to /dev/dsp. However, the hardware may not do what you except due to frequency mismatches. Furthermore, on most systems, the device is often busy as soon as you log in.

Note: Linux sound is a messy subject due to many layers involved. We've just talked about the kernel-level devices, but typically there are user-space servers such as pulseaudio that manage audio from different sources and act as intermediaries between the sound devices and other user-space processes.

Creating Device Files

In modern Linux systems, you do not create your own device files; this is done with devtmpfs and udev. However, it is instructive to see how it was once done, and on a rare occasion, you might need to create a named pipe. The mknod command creates one device. You must know the device name as well as its major and minor numbers. For ex.: creating /dev/sda1 is a matter of using the following command:

 # mknod /dev/sda1 b 8 2

The b 8 2 specifies a block device with a major number 8 and a minor number 2. For character or named pipe devices, use c or p instead of b (omit the major and minor numbers for named pipe). As mentioned earlier, the mknod command is useful only for creating the occasional named pipe. At one time, it was also sometimes useful for creating missing devices in single-user mode during system recovery. In older versions of Unix and Linux, maintaining the /dev directory was a challenge. With every significant kernel upgrade or driver addition, the kernel could support more kinds of devices, meaning that there would be a new set of major and minor numbers to be assigned to device filenames. Maintaining this was difficult, so each system had a MAKEDEV program in /dev to create groups of devices. When you upgraded your system, you would try to find an update to MAKEDEV and then run it in order to create new devices. This static system became ungainly , so a replacement was in order. The first attempt to fix it was devfs, a kernel-space implementation of /dev that contained all of the devices that the current kernel supported. However, there were a number of limitations, which led to the development of udev and devtmpfs.

Udev

We've already talked about how unnecessary complexity in the kernel is dangerous because you can easily introduce system instability. Device file management is an example: You can create device files in user space, so why would you do this in the kernel? The Linux kernel can send notification to user-space process (called udevd) upon detecting a new device on the system (for ex.: when someone attaches a USB flash drive). The user-space process on the other end examines the new device's characteristics, creates a device file, and then performs any device initialization. That was the theory. Unfortunately, in practice, there is a problem with this approach —— device files are necessary early in the boot procedure, so udevd must start early. To create devices files, udevd could not depend on any devices that is was supposed to create, and it would need to perform its initial startup very quickly so that the rest of the system wouldn't get held up waiting for udevd to start.

devtmpfs

The devtmpfs filesystem was developed in response to the problem of device availability during boot. This file-system is similar to the older devfs support, but it's simplified. The kernel creates device files as necessary, but it also notifies udevd that a new device is available. Upon receiving this signal, udevd does not create the device files, but it does perform device initialization and process notification. Additionally, it creates a number of symbolic links in /dev to further identify devices. You can find examples in the directory /dev/disk/by-id, where each attached disk has one or more entries. For ex.: consider this typical disk:

 $ ls -l /dev/disk/by-id/
 lrwxrwxrwx 1 root root  9 Nov  4 10:10 ata-SanDisk_SDSSDA240G_164260468212 -> ../../sda
 lrwxrwxrwx 1 root root 10 Nov  4 10:10 ata-SanDisk_SDSSDA240G_164260468212-part1 -> ../../sda1
 lrwxrwxrwx 1 root root 10 Nov  4 10:10 ata-SanDisk_SDSSDA240G_164260468212-part2 -> ../../sda2
 lrwxrwxrwx 1 root root 10 Nov  4 10:10 ata-SanDisk_SDSSDA240G_164260468212-part3 -> ../../sda3

udevd names the links by interface type, and then by manufacturer and model information, serial number, and partition (if applicable). But how does ubdevd know which symbolic links to create, and how does it create them? The next section describes how ubdev does its work.

udevd Operation and Configuration

The udevd daemon operates as follows:

 1. The kernel sends ubdevd a notification event, called a uevent, through an internal network link. 
 2. udevd loads all of the attributes in the uevent. 
 3. udevd parses its rules, and it takes actions or sets more attributes based on those rules. 

An incoming uevent that udevd receives from the kernel might look like this:

———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 ACTION=change
 DEVNAME=sde
 DEVPATH=/devices/pci0000:00/0000:00:1a.o/usb1/1-1/1-1.2/1-1.2:1.0/host4/target4:0:0/4:0:0:3/block/sde
 DEVTYPE=disk
 DISK_MEDIA_CHANGE=1
 MAJOR=8
 MINOR=64
 SEQNUM=2752
 SUBSYTEM=block
 UDEV_LOG=3   

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

You can see here that there is a change to a device. After receiving the uevent, udevd knows the sysfs device path and a number of other attributes associated with the properties, and it is now ready to start processing rules. The rules files are in the /lib'udev/rules.d and /etc/udev/rules.d directory. The rules in /lib are the defaults, and the rules in /etc are overrides. A full explanation of the rules would be tedious, and you can learn much more from the udevd (7) manual page, but let's look at the symbolic links from the /dev/sda example. Those links were defined by rules in /lib/udev/rules.d/60-persisten-storage.rules. Inside, you'll see the following lines:

 # ATA
 KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{vendor}=="ATA", IMPORT{program}="ata_id --export $devnode"
 # ATAPI devices (SPC-3 or later)
 KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{type}=="5", ATTRS{scsi_level}=="[6-9]*", IMPORT{program}="ata_id --export $devnode"

These rules match ATA disks presented through the kernel's SCSI subsytem. You can see that there are a few rules to catchdifferent ways that the devices may be represented, but the idea is that udevd will try to match a device starting with sd or sr, but without a number (with the KERNEL=="sd*[!0-9]|sr*" expression), as well as a subsystem (SUBSYSTEMS=="scsi"), and finnaly, some other attributes. If all of those conditional expressions are true, udevd moves to the next expression:

 IMPORT{program}="ata_id --export $devnode"
 

This is not a conditional, but rather, a directive to import variables from the /lib/udev/ata_id command. If you have such a disk, try it yourself on the command line:

 $ sudo /lib/udev/ata_id --export /dev/sda
 ID_ATA=1
 ID_TYPE=disk
 ID_BUS=ata
 ID_MODEL=SanDisk_SDSSDA240G
 ID_MODEL_ENC=SanDisk\x20SDSSDA240G\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
 ID_REVISION=Z32080RL
 ID_SERIAL=SanDisk_SDSSDA240G_164260468212
 ID_SERIAL_SHORT=164260468212
 ID_ATA_WRITE_CACHE=1
 ID_ATA_WRITE_CACHE_ENABLED=1
 ID_ATA_FEATURE_SET_HPA=1
 ID_ATA_FEATURE_SET_HPA_ENABLED=1
 ID_ATA_FEATURE_SET_PM=1
 ID_ATA_FEATURE_SET_PM_ENABLED=1
 ID_ATA_FEATURE_SET_SECURITY=1
 ID_ATA_FEATURE_SET_SECURITY_ENABLED=0
 ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=2
 ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=2
 ID_ATA_FEATURE_SET_SECURITY_FROZEN=1
 ID_ATA_FEATURE_SET_SMART=1
 ID_ATA_FEATURE_SET_SMART_ENABLED=1
 ID_ATA_FEATURE_SET_APM=1
 ID_ATA_FEATURE_SET_APM_ENABLED=1
 ID_ATA_FEATURE_SET_APM_CURRENT_VALUE=128
 ID_ATA_DOWNLOAD_MICROCODE=1
 ID_ATA_SATA=1
 ID_ATA_SATA_SIGNAL_RATE_GEN2=1
 ID_ATA_SATA_SIGNAL_RATE_GEN1=1
 ID_ATA_ROTATION_RATE_RPM=0
 ID_WWN=0x5001b444a69bee5b
 ID_WWN_WITH_EXTENSION=0x5001b444a69bee5b

The import now sets the environment so that all of the variable names in this output are set to the values shown. For ex.: any rule that follows will now reconigzed ENV{ID_TYPE} as disk. Of particular note is ID_SERIAL, In each of the rules, this conditional appears second:

 ENV{ID_SERIAL}!="?*"

This means that ID_SERIAL is true only if is not set. Therefore, if it is set, the conditional is false, the entire current rule is false, and udevd moves to the next rule. So what is the point? The object if these two rules (and many around them in the file) is to find the serial number of the disk device. With ENV{ID_SERIAL} set, udevd can now evaluates this rule:

 KERNEL=="sd*|sr*|cciss*", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="?*", SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}"

You can see this rule requires ENV{ID_SERIAL} to be set, and it has one directive:

 SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}"

Upon encountering this directive, udevd adds a symbolic link for the incoming device. So now you know where the device symbolic links came from! You may be wondering how to tell a conditional expression from a directive. Conditionals are denoted by two equal signs (==) or a bang equal (!=), and directives by a singe equal sign (=), a plus equal (+=), or a colon equal (:=).

udevadm

The udevadm program is an administrator tool for udevd. You can reload udevd rules and trigger events, but perhaps the most powerfull features of udevadm are the ability to search for and explore system devices and the abilitiy to monitor uevents as udevd receives them from the kernel. The only trick is that the command syntax can get a bit involved. Let's try to examining a system device, in order to look at all of the udevd attributes used and generated in conjuction with the rules for a device such as /dev/sda, run the following command:

 $ udevadm info --query=all --name=/dev/sda

The output looks like this: ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 P: /devices/pci0000:00/0000:00:17.0/ata3/host2/target2:0:0/2:0:0:0/block/sda
 N: sda
 S: disk/by-id/ata-SanDisk_SDSSDA240G_164260468212
 S: disk/by-id/wwn-0x5001b444a69bee5b
 S: disk/by-path/pci-0000:00:17.0-ata-3
 E: DEVLINKS=/dev/disk/by-id/ata-SanDisk_SDSSDA240G_164260468212 /dev/disk/by-id/wwn-0x5001b444a69bee5b /dev/disk/by-path/pci-0000:00:17.0-ata-3
 E: DEVNAME=/dev/sda
 E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata3/host2/target2:0:0/2:0:0:0/block/sda
 E: DEVTYPE=disk
 E: ID_ATA=1
 E: ID_ATA_DOWNLOAD_MICROCODE=1
 E: ID_ATA_FEATURE_SET_APM=1
 E: ID_ATA_FEATURE_SET_APM_CURRENT_VALUE=128
 E: ID_ATA_FEATURE_SET_APM_ENABLED=1
 E: ID_ATA_FEATURE_SET_HPA=1
 E: ID_ATA_FEATURE_SET_HPA_ENABLED=1
 E: ID_ATA_FEATURE_SET_PM=1
 E: ID_ATA_FEATURE_SET_PM_ENABLED=1
 E: ID_ATA_FEATURE_SET_SECURITY=1
 E: ID_ATA_FEATURE_SET_SECURITY_ENABLED=0
 E: ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=2
 E: ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=2
 E: ID_ATA_FEATURE_SET_SECURITY_FROZEN=1
 E: ID_ATA_FEATURE_SET_SMART=1
 E: ID_ATA_FEATURE_SET_SMART_ENABLED=1
 E: ID_ATA_ROTATION_RATE_RPM=0
 E: ID_ATA_SATA=1
 E: ID_ATA_SATA_SIGNAL_RATE_GEN1=1
 E: ID_ATA_SATA_SIGNAL_RATE_GEN2=1
 E: ID_ATA_WRITE_CACHE=1
 E: ID_ATA_WRITE_CACHE_ENABLED=1
 E: ID_BUS=ata
 E: ID_MODEL=SanDisk_SDSSDA240G
 E: ID_MODEL_ENC=SanDisk\x20SDSSDA240G\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
 E: ID_PART_TABLE_TYPE=dos
 E: ID_PART_TABLE_UUID=ee4f2d27
 E: ID_PATH=pci-0000:00:17.0-ata-3
 E: ID_PATH_TAG=pci-0000_00_17_0-ata-3
 E: ID_REVISION=Z32080RL
 E: ID_SERIAL=SanDisk_SDSSDA240G_164260468212
 E: ID_SERIAL_SHORT=164260468212
 E: ID_TYPE=disk
 E: ID_WWN=0x5001b444a69bee5b
 E: ID_WWN_WITH_EXTENSION=0x5001b444a69bee5b
 E: MAJOR=8
 E: MINOR=0
 E: SUBSYSTEM=block
 E: TAGS=:systemd:
 E: USEC_INITIALIZED=12578709

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The prefix in each line indicates an attribute or other characteristics of the device. In this case:

 P: at the top is the sysfs device path.
 N: is the device node (that is, the name given to the /dev file).
 S: indicates a symbolic link to the device node that udevd placed in /dev according to its rules.
 E: is additional device information extracted in the udevd rules. 

Note: There was far more output in this example than was necessary to show here; try the command for yourself to get a feel for what it does.

Monitoring devices

To monitor uevents with udevadm, use the monitor command:

 $ udevadm monitor
 monitor will print the received events for:
 UDEV - the event which udev sends out after rule processing
 KERNEL - the kernel uevent

Output (for ex.: when you insert a flash media device) looks like this: ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 KERNEL[6723.363078] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8 (usb)
 KERNEL[6723.363418] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0 (usb)
 UDEV  [6723.372499] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8 (usb)
 KERNEL[6723.387398] add      /module/usb_storage (module)
 KERNEL[6723.387728] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6 (scsi)
 KERNEL[6723.387739] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/scsi_host/host6 (scsi_host)
 KERNEL[6723.387753] add      /bus/usb/drivers/usb-storage (drivers)
 UDEV  [6723.388147] add      /module/usb_storage (module)
 UDEV  [6723.388476] add      /bus/usb/drivers/usb-storage (drivers)
 KERNEL[6723.388637] add      /module/uas (module)
 KERNEL[6723.388657] add      /bus/usb/drivers/uas (drivers)
 UDEV  [6723.388891] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0 (usb)
 UDEV  [6723.388902] add      /module/uas (module)
 UDEV  [6723.388910] add      /bus/usb/drivers/uas (drivers)
 UDEV  [6723.389101] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6 (scsi)
 UDEV  [6723.389380] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/scsi_host/host6 (scsi_host)
 KERNEL[6724.401391] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0 (scsi)
 KERNEL[6724.401480] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0 (scsi)
 KERNEL[6724.401587] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/scsi_disk/6:0:0:0 (scsi_disk)
 KERNEL[6724.401648] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/scsi_device/6:0:0:0 (scsi_device)
 KERNEL[6724.401851] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/scsi_generic/sg2 (scsi_generic)
 KERNEL[6724.401943] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/bsg/6:0:0:0 (bsg)
 KERNEL[6724.402790] add      /devices/virtual/bdi/8:32 (bdi)
 UDEV  [6724.404244] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0 (scsi)
 UDEV  [6724.404731] add      /devices/virtual/bdi/8:32 (bdi)
 UDEV  [6724.406139] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0 (scsi)
 UDEV  [6724.407840] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/scsi_disk/6:0:0:0 (scsi_disk)
 UDEV  [6724.409885] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/scsi_device/6:0:0:0 (scsi_device)
 UDEV  [6724.413293] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/bsg/6:0:0:0 (bsg)
 UDEV  [6724.417796] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/scsi_generic/sg2 (scsi_generic)
 KERNEL[6724.635852] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/block/sdc (block)
 KERNEL[6724.635951] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/block/sdc/sdc1 (block)
 KERNEL[6724.636022] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/block/sdc/sdc2 (block)
 UDEV  [6724.922275] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/block/sdc (block)
 UDEV  [6725.029344] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/block/sdc/sdc2 (block)
 UDEV  [6725.029685] add      /devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host6/target6:0:0/6:0:0:0/block/sdc/sdc1 (block)
 KERNEL[6725.313437] add      /module/isofs (module)
 KERNEL[6725.313525] add      /kernel/slab/isofs_inode_cache (slab)
 UDEV  [6725.315504] add      /module/isofs (module)
 UDEV  [6725.316272] add      /kernel/slab/isofs_inode_cache (slab)

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

There are two copies of each message in this output because the default behaviour is to print both the incomming message from the kernel (marked with KERNEL) and the message that udevd sends out to other progams when it's finished processing and filtering the event. To see only kernel events, add the --kernel option, and to see only outgoing events, use --udev. To see the whole incoming uevent, including the attributes, use the --property option.

You can also filter events by subsystem. For ex.: to see only kernel messages pertaining to changes in the SCSI subsystem, use this command:

 $ udevadm monitor --kernel --subsystem-match=scsi

For more on udevadm, see the udevadm (8) manuel page.

In-Depth: SCSI (Small Computer System Interface) and the linux kernel

In this section, we'll take a look at the SCSI support in the Linux kernel as a way to explore part of the Linux kernel architecture. You don't need ti know any of this information in order to use disks, so if you are in hurry to use one, move to chapter4. In addition, the material here is more advanced and theoretical in nature that what you've be seen so far.

Let's begin with a little background. The traditional SCSI hardware setup is a host adapter linked with a chain of devices over an SCSI bus, as you can see here:

Images.duckduckgo.png

The host adapter is attached to a computer. The host adapter and devices each have an SCSI ID, and there can be 8 or 16 ID's per bus, depending on the SCSI version. You might hear the term SCSI target used to refer to a device and its SCSI ID. The host adapter communicates with the devices through the SCSI command set in a peer-to-peer relationship; the devices send reponses back to the host adapter. The computer is not directly attached to the device chain, so it must go through the host adapter in order to communicate with disks and other devices. Typically, the computer sends SCSI commands to the host adapter to relay to the devices, and the devices relay responses back through the host adapter. Newer version of SCSI, such as Serial Attached SCSI (SAS), offer exceptional performance, but you probably won't find true SCSI devices in most machines. You'll more often encounter USB storage devices that use SCSI commands. In addition, devices supporting ATAPI (such as CD/DVD-ROM drives) use a version of the SCSI command set. SATA disks also appear on your system as SCSI devices by means of a translation layer in libata. Some SATA controllers (especially high-performance RAID controllers) perform this translation in hardware. How does this all fit together? Consider the devices shown on the following system:

 $ lsscsi
 [2:0:0:0]    disk    ATA      SanDisk SDSSDA24 80RL  /dev/sda 
 [3:0:0:0]    disk    ATA      ST1000DM003-1SB1 CC43  /dev/sdb 
 [6:0:0:0]    disk    Kingston DataTraveler 2.0 1.00  /dev/sdc 
 [4:0:0:0]    disk    USB2.0   CardREader SD          /dev/sdd 
 [5:0:0:0]    disk    FLASH    Drive UT_USB20         /dev/sdf 

The numbers in brackets are, from left to right:

 SCSI host adapter number
 SCSI bus number
 Device SCSI ID
 LUN (logical unit number)

In this example, there are five attached adapters (scsi2, scsi3, scsi4, scsi5 andscsi6). Each of which has a single bus (all with number 0), and just one device on each bus (all with target 0).

Here you can see, the driver and interface hierarchy inside the kernel for this particular system configuration, from the individual device drivers up to the block drivers. It does not include the SCSI generic (sg) drivers.

20171104 164842.jpg

Although this is a large structure and may look overwelming at first, the data flow in the figure is very linear. Let's begin dissecting it by looking at the SCSI subsystem and its three layers of drivers:

 * The top layer handles operations for a class of device. For ex.: the sd (SCSI disk) driver is at this layer; it knows how to translate requests from the kernel block device interface into 
   disk-specific commands in the SCSI protocol, and vice versa.
 * The middle layer moderates and routes the SCSI message between the top and bottom layers, and keeps trakc of all the SCSI buses and devices to the system. 
 * The bottom layer handles hardware-specific actions. The drivers here send outgoing SCSI protocol messages to specific host adapter or hardware , and they extract incoming message from the hardware. 
   The reason for this separation from the top layer is that although SCSI message are uniform fospr a device class (such as disk class), different kinds of host adapters have varying 
   porocedures for sending the same messages.

The top and bottom layers contain many different drivers, but it's important to remember that, for any given device file on your system, the kernel uses one top-layer driver and one lower-layer driver. For the disk at /dev/sda in our example, the kernel uses the sd top-layer driver and the ATA bridge lower-layer driver. There are times when you might use more than one upper-layer driver for one hardware device. For true hardware SCSI devices such as disk attached to an SCSI host adapter or a hardware RAID controller, the lower-layer drivers talk directly to the hardware below. However, for most hardware that you find attached to the SCSI subsystem, it's different story.

USB storage and SCSI

In order for the SCSI subsystem to talk to common USB storage hardware, the kernel needs more than just a lower-layer SCSI driver. The USB flash drive represented by /dev/sdf understands SCSI commands, but to actually communicate with the drive, the kernel needs to know how to talk through the USB system. In the abstract, USB is quite similar to SCSI —— it has driver classes, buses , and host controllers. Therefore, it should be no surprise that the Linux kernel includes a three-layer USB subsystem that closely resembles the SCSI subsystem, which device-class drivers at the top, a bus management core in the middle, and host controller drivers at the bottom. Much as the SCSI subsystem passes SCSI commands between its components, the USB subsystem passes USB message between its components. There's even an lsusb command that is similar to lsscsi. The part we're really interested in here us hte USB storage driver at the top. This driver acts as a translator. On one side, the driver speaks SCSI, and on the other, it speaks USB. Because the storage hardware includes SCSI commands inside its USB message, the driver has relatively easy job: it mostly repackages data.

With both the SCSI and USB subsystems in place, you have almost everything you need to talk to the flash drive. The final missing link is the lower-layer driver in the SCSI subsystem because the USB storage driver is a part of the USB subsystem, not the SCSI subsystem. (For organizational reasons, the two subsystems should not share a driver.) To get the subsystems to talk to one another, a simple, lower-layer SCSI bridge driver connects to the USB subsystem's storage driver.

SCSI and ATA

The SATA hard disk and optical drive use the same SATA interface. To connect the SATA-specific drivers of the kernel to the SCSI subsystem, the kernel employs a bridge driver, as with the USB drives, but with a different mechanisme and additional complications. The optical drive speaks ATAPI, a version of SCSI commands encoded in the ATA protocol. However, the hard disk does not use ATAPI and does not encode any SCSI commands! The Linux kernel uses part of a library called libata to reconcil SATA (and ATA) drives with the SCSI subsystem. For the ATAPI-speaking optical drives, this is a relatively simple task of packaging and extracting SCSI commands into and from ATA protocol. But for the hard disk, the task is much more complicated because the library must do a full command translation.

The job of the optical drive is similar to typing an English book into a computer. You don't need to understand what the book is about in order to do this job, nor do you even need to understand English. But the task for the hard disk is more like reading a German book and typing it into the computer as an English translation. In this case, you need to understand both languages as well as the book's content. Despite this difficulty, libata performs this task and makes it possible to attach SCSI subsystem to ATA/SATA interfaces and devices.

Generic SCSI devices

When a user-space process communicates with the SCSI subsystem, it normally does so through the block device layer and/or another kernel service that sits on top of an SCSI device class driver (like sd or sr). In other words, most user processes never need to know anything about SCSI devices or their commands. However, user processes can bypass device class drivers and give SCSI protocol commands direclty to devices through their generic devices. For ex.: consider the lssci command but this time add the -g option, in order to show the generic devices :

 $ lsscsi -g
 [2:0:0:0]    disk    ATA      SanDisk SDSSDA24 80RL  /dev/sda   /dev/sg0 
 [3:0:0:0]    disk    ATA      ST1000DM003-1SB1 CC43  /dev/sdb   /dev/sg1 
 [6:0:0:0]    disk    Kingston DataTraveler 2.0 1.00  /dev/sdc   /dev/sg2
 [1:0:0:0]    cd/dvd  Slimtype DVD A DS8A5SH          /dev/sr0   /dev/sg3

In additon to the usual block device file, each entry lists and SCSI generic device file in the last column. For ex.: the generic device for the hard disk /dev/sr0 is /dev/sg3.

Why would you want to use an SCSI generic device? The answer has to do with the complexity of code in the kernel. As tasks get more complicated, it's better to leave them out the kernel. Consider CD/DVD writing and reading. Not only is writing significantly more difficult than reading, but no critical system services depend on the action of writing. A user-space program might do the writing a little more inefficiently than a kernel service, but that program will be far easier to build and maintain than a kernel service, and bugs will not threaten kernel space. Therefore, to write to an optical disc in Linux, you run a program that talks to a generic SCSI device, such as /dev/sg3. Due to the relative simplicity of reading compared to writing, however, you still read from the device using the specialized sr optical device driver in the kernel.

Multiple Access Methods for a Single Device

The two points of access (sr and sg) for an optical drive from user space are illustrated for the Linux SCSI subsystem (any drivers below the SCSI lower layer have been omitted). Process A reads from the drive using the sr driver, and process B writes to the drive with the sg driver. However, processes such as these two would not normally run simultaneously to access the same device.

20171104 180222.jpg

Process A reads from the block device. But do user processes really read data from this way? Normally, the answer is no, not directly. There are more layer on the top of the block devices and even more points od access for hard disks, as you'll learn in the next chapter.

Disks And Filesystems

Linux disk shematic.png

Partitions are subdivisions of the whole disk. On Linux, they're denoted with a number after the whole block device, and therefore have devices names sudh as /dev/sda1 and /dev/sdb3. The kernel presents each partition as a block device, just as it would an entire disk. Partitions are defined on a small area of the disk called a partition table.

Note: Multiple data partitions were once common on systems with large disks because older PCs could boot only from certain parts of the disk. Also, administrators used partitions to reserve a certain amount of space for OS areas; for ex.: they didn't want users to be able to fill up the entire system and prevent critical services from working. This practice is not unique to Unix; you'll still find many new Windows systems with several partitions on a single disk. In addition, most systems have a separated swap partition.

Although the kernel makes it possible for you to access both an entire disk and once of its partitions at the same time, you would not normally do so, unless you were copying the entire disk. The next layer after the partition is the filesystem, the database of files and directories that you're accustomed to interacting with in user space. Als you can see, if you want to access the data in a file, you need to get the appropriate partition location from the partition table and then search the filesystem database on that partition for the desired file data. To access data on a disk, the Linux kernel uses the system of layers. The SCSI subsystem and everything else described in Chapter 3.6 are represented by a single box. (Notice that you can work with the disk through the filesystem as well as directly through the disk device. You'll do both in this chapter.) To get a handle on how everything fits together, let's start at the bottom with partitions.

Kernel shematic for disk access.png

Partitioning Disk Devices

There are many kinds of partition tables. The traditional table is the once found inside the Master Boot Record (MBR). A newer standard starting to gain traction is the Globallly Unique Identifier Partition Table (GPT). Here is an overiew of the many Linux partioning tools available:

 parted : A text-based tool that supports both MBR and GTP. 
 gparted: A GUI version of parted. 
 fdisk  : The traditional text-based Linux disk partitioning tool. fdisk does not support GPT.
 gdisk  : A version of fdisk that supports GPT but not MBR. 

Because it support both MBR and GPT, we'll use parted and as second model fdisk. Many people prefer the fdisk interface, and there's nothing wrong with that.

Note: Although parted can create and resize filesystems, you shouldn't use it for filesystem manipulation because you can easily get confused. There is a critical difference between partitioning and filesystem manipulation. The partition table defines simple boundaries on the disk, whereas a filesystem is a much more involved data system. For this reason, we'll use parted for partitioning but use separate utilities for creating filesystems. Even the parted documenation encourages you to create filesystems separately.

Viewing a Partition Table

You can view your system's partition table with parted -l or fdisk -l. Here is sample output from three disk devices with two different kinds of partition tables (MBR and GPT):

 # parted -l
 Model: ATA SanDisk SDSSDA24 (scsi)
 Disk /dev/sda: 240GB
 Sector size (logical/physical): 512B/512B
 Partition Table: msdos
 Disk Flags: 
 
 Number  Start   End     Size    Type     File system     Flags
  2      1049kB  75,7GB  75,7GB  primary  xfs
  3      75,7GB  84,3GB  8566MB  primary  linux-swap(v1)
  1      84,3GB  240GB   156GB   primary  xfs
 
 
 Model: ATA ST1000DM003-1SB1 (scsi)
 Disk /dev/sdb: 1000GB
 Sector size (logical/physical): 512B/4096B
 Partition Table: msdos
 Disk Flags: 
 
 Number  Start   End     Size    Type     File system  Flags
  1      1049kB  1000GB  1000GB  primary  ext4
   
   
 Model: Kingston DataTraveler 2.0 (scsi)
 Disk /dev/sdc: 16,1GB
 Sector size (logical/physical): 512B/512B
 Partition Table: gpt
 Disk Flags: 
 
 Number  Start   End     Size    Type     File system  Name      Flags
  2      84,0kB  67,2MB  67,1MB  primary  fat16        myusb     esp
  

or if you use fdisk -l:

 # fdisk -l 
 Disk /dev/sdb: 931,5 GiB, 1000204886016 bytes, 1953525168 sectors
 Units: sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 4096 bytes / 4096 bytes
 Disklabel type: dos
 Disk identifier: 0x00040d20
 
 Device     Boot Start        End    Sectors   Size Id Type
 /dev/sdb1        2048 1953523711 1953521664 931,5G 83 Linux
 
 
 Disk /dev/sda: 223,6 GiB, 240057409536 bytes, 468862128 sectors
 Units: sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disklabel type: dos
 Disk identifier: 0xee4f2d27
 
 Device     Boot     Start       End   Sectors   Size Id Type
 /dev/sda1       164657152 468857024 304199873 145,1G 83 Linux
 /dev/sda2            2048 147927039 147924992  70,5G 83 Linux
 /dev/sda3       147927040 164657151  16730112     8G 82 Linux swap / Solaris
 
 Partition table entries are not in disk order.
 
 
 Disk /dev/sdc: 15 GiB, 16131293184 bytes, 31506432 sectors
 Units: sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disklabel type: gpt
 Disk identifier: 0x36c94279
 
 Device     Boot Start     End Sectors  Size Id Type
 /dev/sdc1  *        0 1884159 1884160  920M  0 Empty
 /dev/sdc2         164  131235  131072   64M ef EFI (FAT-12/16/32)

The first device /dev/sda and the second device /dev/sdb, use the traditional MBR partition table (called msdos by parted and dos by fdisk), and the third device /dev/sdc contains a GPT table. Notice that there are different parameters for each partition table, because the tables themselve are different. In particular, there is no Name column for the MBR table because names don't exist under that scheme. The MBR table in this example contains primary partitions (but it could have extended and / or locigal partitions). A primary partition is an normal subdivision of the disk; partition 1 in such a partition. The basic MBR has a limit of four partitions, so if you want more than four, you designate one partition as an extended partition. Next, you subdivide the extended partition into logical partitions that the OS can use as it would any other partition. For ex: You can have partition 2 as extended partition that will contains logical partition 5.

Note: The filesystem that parted list is not necessarily the system ID field defined in most MBR entries. The MBR system ID is just a number; for example 83 is a linux partition and 82 is Linux swap. Therefore, parted attempts to determine a filesystem on its own. If you absolutely must know the system ID for an MBR, use fdisk -l.

Initial Kernel Read:

When initially reading the MBR table, the Linux kernel produces the following debugging output (remember that you can view this with dsmeg). For ex.:

 $ dmesg 
 sda: sda1 sda2 < sda5 > 

The sda2 < sda5 > output indicates that /dev/sda2 is an extended partition containing one logical parition, /dev/sda5. You'll normally ignore extended partitions because you'll typically want to access only the logical partitions inside.

Changing Partitions Tables

Viewing partitions tables is a relatively simple and harmless operation. Altering partition tables is also relatively easy, but there are risks involved in making this kind of change to the disk. Keep the following in mind:

 * Changing the partition table makes it quite difficult to recover any data on partitions that you delete because it changes the initial point of reference for a filesystem. 
   Make sure that you have a backup if the disk you're partitioning contains critical data. 
 * Ensure that no partitions on your target disk are currently in use. This is a concern because most Linux distributions automatically mount any detected filesystem. 

When you are ready, choose your partitioning program. If you'd like to use parted, you can use the command-line parted utility or a GUI such as gparted; for an disk-style interface, use gdisk if you're using GPT partitioning. These utilities all have online help and are easy to learn. (Try using them on a flash device or something similar if you don't have any spare disks).

That said, there is a major difference in the way that fdisk and parted work. With fdisk, you design your new partition table before making the actual changes to the disk; fdisk only makes changes as you exit the program. But wuth parted, partitions are created, modified, and removed as issue the commands. You don't get teh chance to review the partition table before you change it.

These differences are also important to understanding how these two utilities interact with the kernel. Both fdisk and parted modify the partitions entirely in user space; there is no need to provide kernel support for rewriting a partition table because user space can read and modify all of a block device. Eventually though, the kernel must read the partition table in order to present the partitions as block devices. the fdisk utility uses a relatively simple method: after modifying the partition table, fdisk issues a single system call on the disk to tell the kernel that it should reread the partition table. The kernel then generates debugging output that you can view with dmesg. For ex.: if you create two partitions on /dev/sdf, you'll see this:

 sdf: sdf1 sdf2 

In comparison, the parted tools do not use this disk-wide system call. Instead, they signal the kernel when individual partitions are altered. After processing a single partition change, the kernel does not produce the preceding debbuging output.

There are a few ways to see the partition changes:

 * Use udevadm to watch the kernel event changes. 
 * Check /proc/partitions for full partition information. 
 * Check /sys/block/device for altered partition system interfaces or /dev for altered partition devices.

If you absolutely must be sure that you have modified a partition table, you can perform the old-style system call that fdisk uses by using the blockdev command. For ex.: to force the kernel to reload the partition table on /dev/sdf, run this:

 # blockdev --rereadpt /dev/sdf

At this point, you have all you need to know about partitioning disks.

For ex.: Here i will partition my usb /dev/sdc with two partitions.

 $ sudo fdisk -l /dev/sdc
 Device     Boot Start      End  Sectors Size Id Type
 /dev/sdc1        2048 31422463 31420416  15G  b W95 FAT32
  
 $ sudo sgdisk --zap-all /dev/sdc
 $ sudo fdisk /dev/sdc
  
 Welcome to fdisk (util-linux 2.31).
 Changes will remain in memory only, until you decide to write them.
 Be careful before using the write command.
 
 Command (m for help): o #with o i created a new empty DOS parition table.
 Created a new DOS disklabel with disk identifier 0x37c569e9.
 
 Command (m for help): n
 Partition type
    p   primary (0 primary, 0 extended, 4 free)
    e   extended (container for logical partitions)
 Select (default p): p
 Partition number (1-4, default 1): 1
 First sector (2048-31422463, default 2048): 2048
 Last sector, +sectors or +size{K,M,G,T,P} (2048-31422463, default 31422463): +1G
 
 Created a new partition 1 of type 'Linux' and of size 1 GiB.
 Partition #1 contains a vfat signature.
 
 Do you want to remove the signature? [Y]es/[N]o: Yes
   
 The signature will be removed by a write command.
   
 Command (m for help): n
 Partition type
    p   primary (1 primary, 0 extended, 3 free)
    e   extended (container for logical partitions)
 Select (default p): e
 Partition number (2-4, default 2): 2
 First sector (2099200-31422463, default 2099200): 
 Last sector, +sectors or +size{K,M,G,T,P} (2099200-31422463, default 31422463): 
  
 Created a new partition 2 of type 'Extended' and of size 14 GiB.
  
 Command (m for help): p
 Disk /dev/sdc: 15 GiB, 16088301568 bytes, 31422464 sectors
 Units: sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disklabel type: dos
 Disk identifier: 0x37c569e9
 
 Device     Boot   Start      End  Sectors Size Id Type
 /dev/sdc1          2048  2099199  2097152   1G 83 Linux
 /dev/sdc2       2099200 31422463 29323264  14G  5 Extended
 
 Filesystem/RAID signature on partition 1 will be wiped.
   
 Command (m for help): w
 The partition table has been altered.
 Calling ioctl() to re-read partition table.
 Syncing disks.
 

Disk and Partition Geometry

Any device with moving parts introduces complexity into a software system because there are physical elements that resist abstraction. A hard disk is no exception; even though you can think of a hard disk as a block device with random access to any block, there are serious performance consequences if you aren't careful about how you lay out data on the disk. Consider the physical properties of the simple single-platter disk as illustared here.

Harddisk.png

The disk consists of a spinning platter in a spindle, with a head attached to a moving arm that can sweep across the radius of the disk. As the disk spins underneath the head, the head reads data. When the arm is in one position, the head can only read data from fixed circle. This circle is called a cylinder because larger disks have more than one platter, all stacked and spinning around the same spindle. Each platter can have one ore two heads, for the top and/or bottom of the platter, and all heads are attached to the same arm and move in concert. Because the arm moves, there are many cylinders on the disk, for small ones around the center to large ones around the periphery of the disk. Finally, you can divide a cylinder into slices called sector. This way of thinking about the disk geometry is called CHS, for cylinder-head-sector.

Note: A track is a part of a cylinder that a single head accesses, so, a cylinder is also a track. You probably don't need to worry about tracks.

The kernel and the various partitioning programs can tell you what a disk reports as its number of cylinders (and sectors, which are slice of cylinders). However, on a modern hard disk, the reported values are fiction! The traditional addressing scheme that uses CHS doesn't scale with modern disk hardware, nor does it account for the fact that you can put more data into outer cylinders than inner cylinders. Disk hardware supports Loigcal Block Addressing (LBA) to simplify address a location on the disk by a block number, but remnants of CHS remain. For ex.: the MBR partition table contains CHS information as well as LBA equivalents, and some boot loaders are still dumb enough to believe the CHS values (don't worry —— most Linux boot loaders use the LBA values). Nevertheless, the idea of cylinders has been important to partitioning because cylinders are ideal boundaries for partitions. Reading a data stream from a cylinder is very fast because the head can continously pick up data as the disk spins. A partition arranged as a set of adjacent cylinders als allows for fast continuous data access because the head doesn't need to move very far between cylinders. Some partitioning programs complain if you don't place your partitions precisely on cylinder boundaries. Ignore this; there's little you can do because the reported CHS values of modern disks simply aren't true. The disk's LBA scheme ensures that your partitions are where they're supposed to be.

Solid-state Disks (SDD's)

Storage device with no moving parts, such as solid-state disks (SDD), are radically different from spinning disks in term of their access characteristics. For these, random access is not a problem because there's no head to sweep accross a platter; but certain factors affect performance. One of the most significant factors affecting the performance of SDD is partition alignement. When you read data from an SDD, you read it in chunks —— typically 4096 bytes at a time —— and the read must begin at a multiple of that same size. So if your partition and its data do not lie on a 4096-byte boundary, you may have to do two reads instead of one small, common operations, such as reading the contents of a directory.

Many partitioning utilities (parted and gparted, for example) include functionality to put newly created partitions at the proper offsets from the beginning of the disks, so you may never need to worry about improper partition alignement. However, if you're curious about where your partitions begin and just want to make sure that they begin on a boundary, you can easily find this information by looking in /sys/block. Here's an example for a partition /dev/sda1:

 $ cat /sys/block/sda/sda1/start 
 164657152

This partition starts at 164,657,152 bytes from the beginning of the disk. Because this number is not divisible by 4096, the partition would not be attaining optimal performance if it were on SSD.

Filesystems

The last link between the kernel and user space for disks is typically the filesystem; this is what you're accustomed to interacting with when you run commands such as ls and cd. As previously mentioned, the filesystem is a form of database; it supplies the structure to transform a simple block device into the sophisticated hierarchy of files and subdirectories that user can understand.

At one time, filesystems resided on disks and other physical media used exclusively for data storage. However, the three-like directory strucuture and I/O interface of filesystems are quite versatile, so filesystems now perform a variety of tasks, such as the system interfaces that you can see in /sys and /proc. Filesystems are also traditionally implemented in the kernel, but the innovation of 9P from Plan 9 has inspired the development of user-space filesystems. The File system in User Space (FUSE) feature allow user-space filesystems in Linux.

The Virtual File System (VFS) abstraction layer completes the filesystem implementation. Much as the SCSI sybsystem standardizes communication between different device type and kernel control commands, VFS ensures that all filesystem implementations support a standard interface so that user-space applications access files and directories in the same manner. VFS support has enables Linux to support an extraordinarily large number of filesystems.

Filesystem Types

Linux filesystem support includes native designs optimized for Linux, foreign types such as the Windows FAT family, universal filesystems like ISO 9660, and many others. The following list includes the most common types of filesystems for data storage. The type names as recognized by Linux are in paranthese next to the filesystem names.

 *ext4: is the current iteration of a line of filesystems native to Linux. ext2 was a longtime default for Linux systems inspired by traditional Unix filesystems sush as the Unix File System (UFS). 
      ext3 added a journal feature (a small cache outside the normal filesystem data structure) to enchance data integrity and hasten booting. The ext4 filesystem is an incremental improvement
        with support for larger files than ext2 and ext3 support and a greater number of subdirectories. 
 *ISO 9660: is a CD-ROM standard. Most CD-ROM use some variety of the ISO 9660 standard. 
 *FAT filesystems (msdos, vfat, umsdos): pertain to Microsoft systems. The simple msdos type supports the very primitive monocase variety in MS-DOS systems. For most modern Windows filesystems, you
                                         should use the vfat filesystem in order to get full access from Linux. The rarely used umsdos filesystem is peculiar to Linux. It supports Unix features such 
                                         as symbolic links on top of MS-DOS filesystem.
 *HFS+ (hfsplus): is an Apple standard used on most Macintosh systems. 

Although the Extended filesysten series has been perfectly acceptable to most casual users, many advances have been made in filesystem technology that even ext4 cannot utilize due to backward compatibility requirement. The advances are primarly in scalability enchancements pertaining to very large numbers of files, large files, and similar scenarios. New Linux filessytems, such as xfs or Btrfs, are under development and may be poised to replace the Extended series.

Creating a filesystem

Once you're done with the partitioning process, you're ready to create filesystems. As with partioning, you'll do this in user space because user-space process can directly access and manipulate a block device. The mkfs utility can create many kinds of filesystems. For ex.: you can create an ext4 partiton on /dev/sdc1 with this command:

 # mkfs -t ext4 /dev/sdc1

The mkfs program automatically determines the number of blocks in a device and sets some reasonable defaults. unless you really know what you're doing and feel like reading the documentation in detail, don't change these. When you create a filesystem, mkfs prints diagnostic output as it works, including output pertaining to the superblock. The superblock is a key component at the top level of filesystem database, and it's so important that mkfs creates a number of backups in case the original is destroyed. Consider recording a few of the superblock backup numbers when mkfs runs, in case you need to recovery the superblock in the event of a disk failure.

Warning: Filesystem creation is a task that you should only need to perform after adding a new disk or repartitioning an old one. You should create a filesystem just once for each new partiton that has no preexisting data (or that has data that you want to remove). Creating a new filesystem on top of an existing filesystem will effectively destroy the old data.

It turns out that mkfs is only a frontend for a series of filesystem creation programs, mkfs.fs, where fs is a filesystem type. So when you run mkfs -t ext4, mkfs in turn runs mkfs.ext4. Inspect the mkfs.* files behind the commands and you'll see more filesystem types mkfs can run:

 $ ls -l /sbin/mkfs.* 
 -rwxr-xr-x 1 root root  26552 19 oct 14:29 /sbin/mkfs.bfs
 -rwxr-xr-x 1 root root 445368  9 sep 18:20 /sbin/mkfs.btrfs
 -rwxr-xr-x 1 root root  34680 19 oct 14:29 /sbin/mkfs.cramfs
 lrwxrwxrwx 1 root root      9  9 jun 15:34 /sbin/mkfs.exfat -> mkexfatfs
 -rwxr-xr-x 4 root root 125000 18 oct 20:14 /sbin/mkfs.ext2
 -rwxr-xr-x 4 root root 125000 18 oct 20:14 /sbin/mkfs.ext3
 -rwxr-xr-x 4 root root 125000 18 oct 20:14 /sbin/mkfs.ext4
 -rwxr-xr-x 1 root root  30712 22 sep 01:49 /sbin/mkfs.f2fs
 -rwxr-xr-x 1 root root  31680  9 mar  2017 /sbin/mkfs.fat
 -rwxr-xr-x 2 root root  55936 16 mai  2013 /sbin/mkfs.jfs
 -rwxr-xr-x 1 root root  79816 19 oct 14:29 /sbin/mkfs.minix
 lrwxrwxrwx 1 root root      8  9 mar  2017 /sbin/mkfs.msdos -> mkfs.fat
 -rwxr-xr-x 1 root root  27344 21 oct  2016 /sbin/mkfs.nilfs2
 lrwxrwxrwx 1 root root     15 29 mar  2017 /sbin/mkfs.ntfs -> /usr/bin/mkntfs
 lrwxrwxrwx 1 root root     10  9 jun  2016 /sbin/mkfs.reiserfs -> mkreiserfs
 lrwxrwxrwx 1 root root      8  9 mar  2017 /sbin/mkfs.vfat -> mkfs.fat
 -rwxr-xr-x 1 root root 478608 20 jui 23:42 /sbin/mkfs.xfs

Mounting a Filesystem

On Unix, the process of attaching a filesystem is called mounting. When the system boots, the kernel reads some configuration data and mounts root (/) based on the configuration data. In order to mount a filesystem, you must know the following:

 * The filesystem's device (such as a disk partition). 
 * The filesystem type. 
 * The mount point —— that is, the place in the current system directory hierachy where the filesystem will be attached. The mount point is always a normal directory. 

When mounting a filesystem, the common terminology is mount a device on a mount point. To learn the current filesystem status of your system, run mount. The output should look like this:

 $ mount  
 proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
 sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
 dev on /dev type devtmpfs (rw,nosuid,relatime,size=8176964k,nr_inodes=2044241,mode=755)
 run on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755)
 /dev/sda2 on / type xfs (rw,noatime,attr2,inode64,noquota)
 securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
 tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
 devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
 tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
 cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
 pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
 cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
 cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
 cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
 cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
 cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
 cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
 cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
 cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
 systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct)
 debugfs on /sys/kernel/debug type debugfs (rw,relatime)
 hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
 mqueue on /dev/mqueue type mqueue (rw,relatime)
 configfs on /sys/kernel/config type configfs (rw,relatime)
 tmpfs on /tmp type tmpfs (rw,noatime)
 /dev/sda1 on /home type xfs (rw,noatime,attr2,discard,inode64,noquota)
 fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
 tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=1636372k,mode=700,uid=1000,gid=1000)
 gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
 /dev/sdb1 on /home/alice/hdd type ext4 (rw,relatime,data=ordered)

Each line corresponds to one currently mounted filesystem, with items in this order:

 * the device, such as /dev/sda1. Notice that some of these aren't real devices (proc, for example) but are stand-ins for real device names because they special-purpose filesystems do not need devices. 
 * the word on. 
 * the mount point. 
 * the word type.
 * the filesystem type, usually in the form of a short identifier. 
 * mount options (in parantheses). 

To mount a filesystem, use the mount command as follows with the filesystem type, and desired mount point:

 # mount -t type device mountpoint

For ex.: to mount xfs /dev/sdb1 on /mnt/hdd, use this command:

 # mount -t xfs /dev/sdv1 /mnt/hdd

You normally don't need to supply the -t type option because mount can usually figure it out for you. However, sometimes it's necessary to distinguish between to similar types, such as the various FAT-style filesystems. To unmount (detach) a filesystem, use the umount command:

 # umount mountpoint. 

You can also unmount a filesystem with it's device instead of its mount point.

 # umounnt /dev/sdb1

Filesystem UUID

The method of mounting filesystems discussed in the preceding section depends on device names. However, device name can changed because they depend on the order in which the kernel finds the device. To solve this problem, you can identify and mount filesystems by their Universally Unique Identifier (UUID), a software standard. The UUID is a type of serial number, and each one should be different. Filesystem creation progams like mke2fs generate a UUID identifier when initializing the filesystem data structure.

To view a list of devices and the corresponding filesystems and UUID's on your system, use the blkid (block ID) program:

 $ blkid
 /dev/sdc1: UUID="ECB3-9428" TYPE="vfat" PARTUUID="d6ff5981-01"
 /dev/sdb1: LABEL="my_hdd_1_to" UUID="41fe4ca1-4873-4b5a-bc81-bbae4546cfd1" TYPE="ext4" PARTUUID="00040d20-01"
 /dev/sda1: UUID="297ec2d8-8e55-4371-817a-e5a7d45d39ab" TYPE="xfs" PARTUUID="ee4f2d27-01"
 /dev/sda2: UUID="0614a041-d4b2-4ca6-945e-bd5850381661" TYPE="xfs" PARTUUID="ee4f2d27-02"
 /dev/sda3: UUID="843b282d-fb0c-44c6-a972-58db7b7f6ca7" TYPE="swap" PARTUUID="ee4f2d27-03"

In this exampe, blkid found five partitions with data: two with xfs filesystems, one with ext4 filesystems, one with a swap signature, and one with a FAT-based filesystem. The Linux native partitions all have standard UUID's, but the FAT partition doesn't have one. You can reference the FAT parition with the FAT volume serial number (in this case, d6ff5981-01).

To mount a filesystem by its UUID, use the UUID=syntax. For ex.: to mount the first filesystem from the preceding list on /mnt/hdd, enter:

 # mount UUID=d6ff5981-01 /mnt/hdd

You wi'll typically not manually mount filesystems by UUID ad above, because you'll probably know the device, and it's much easier to mount a device by it's name than by its crazy UUID number. Still, it's important to understand UUID's. For one thing, they're the preferred way to automatically mount filesystems in /etc/fstab at boot time. In addition, many distribution use the UUID as a mount point when you insert removable media. In the preceding example, the FAT filesystem is on a flash media card. An Ubuntu system with someone logged in will mount this partition at /media/d6ff5981-01 upon insertion. The udevd daemon handles the initial event for the device insertion. You can change the UUID of a filesystem if necessary (for ex.: if you copied the complete filesystem from somewhere else and now need to distinguish it from the original). See tune2fs(8) manual page for how to do this on an ext2, ext3 and ext4 filesystem.

Disk Buffering, Caching, and Filesystems

Linux, like other version of Unix, buffers writes to the disk. This means that the kernel usually doesn't immediatily writes changes to filesystems when processes request changes. Instead it stores the changes in RAM until the kernel can conveniently make the actual change to the disk. This buffering system is transparent to the user and improves performance.

When you unmount a filesystem with umount, the kernel automatically synchronizes with the disk. At any other time, you can force the kernel to write changes in its buffer to the disk by running the sync command. If for some reason you can't unmount a filesystem before your turn off the system, be sure to run sync first. In addition, the kernel has a series of mechanisms that use RAM to automatically cache blocks read from disk. Therefore, if one or more processes repeatedly access a file, the kernel doesn't have to go to the disk again and again —— it can simply read from the cache and save time and resources.

Filesystem Mount Options

There are many ways to change the mount command behaviour, as is often necessary with removable media or when performing system maintenance. In fact, the total number of mount options is staggering. The extensive mount(8) manual page is a good reference, but it's hard to know where to start and what you can safely ignore. You'll see the most useful options in this section.

Options fall into two rough categories: general and filesystem-specific ones. General options include -t for specifying the filesystem type (as mentioned earlier). In contrast, a filesystem-specific option pertains only to certain filesystem types. To activate a filesystem option, use the -o switch followed by the option. For ex.: -o norock turns off Rock Ridge extensions on an ISO 9660 filesystem, but it has no meaning for any other kind of filesystem.

Short Options: The most important general options are these:

 -r The -r option mounts the filesystem in read-only mode. You don't need to specify this option when accessing a read-only device such as a CD-ROM. 
 -n The -n option ensures that mount does not try to update the system runtime mount database, /etc/mtab. The mount operation fails when it cannot write to this file, which is important at boot time
    because the root partition (and, therefore, the system mount database) is read-only at first. You'll also find this option handy when trying to fix a system problem in single-user-mode, because the 
    system mount database may not be available at the time. 
 -t The -t type option specifies the filesystem type. 

Long Options Short option like -r are too limited for the ever-increasing number of mount options; there are too few letters in the alphabet to accommodate all possible options. Short options are also troublesome because it is difficult to determine an option's meaning based on a single letter. Many general options and all filesystem-specific options use a longer, more flexible option format. To use long options with mount on the command line, start with -o and supply some keywords. Here's a complete example, with the long options following -o:

 # mount -t vfat /dev/hda1 /dos -o ro,conv=auto

The two long options here are ro and conv=auto. The ro option specifies read-only mode and is the same as the -r short option. The conv=auto option tells the kernel to automatically convert certain text files from the DOS newline format to the Unix style (You'll see more shortly).

The most useful long options are these:

 exec, noexec Enables or disables execution of programs on the filesystem. 
 suid, nosuid Enables or disables setuid programs. 
 ro           Mounts the filesystem in read-only mode. 
 rw           Mounts the filesystem in read-write mode. 
 conv=rule (FAT) Converts the newline characters in files based rule, which can be binary, text, or auto. The default is binary, which disables any character translation. To treat all files
                 as text, use text. The auto setting converts files based on their extension. For ex.: a .jpg file gets no special treatment, but a .txt file does. Be careful with this option
                 because it can damage files. Consider using it in read-only mode. 

Remounting a Filesystem

There will be times when you may need to reattach a currently mounted filesystem at the same mount point when you need to change mount options. The most common such situation is when you need to make a read-only filesystem writable during crash recovery. The following command remounts the root in read-write mode (you need the -n option because the mount command can't write to the system mount database when the root is read-only).

 # mount -n -o remount /

This command assumes that the correct device listing for / is in /etc/fstab file . If it is not, you must specify the device.

The /etc/fstab Filesystem Table

To mount filesystems at boot time and take the drudgery out of the mount command, Linux systems keep a permanent list of filesystems and options in /etc/fstab. This is a plaintext file in a very simple format:

 $ cat /etc/fstab 
 # /etc/fstab: static file system information.  
 #
 # Use 'blkid' to print the universally unique identifier for a device; this may
 # be used with UUID= as a more robust way to name devices that works even if
 # disks are added and removed. See fstab(5).
 #
 # <file system>                          <mount point>  <type>  <options>                <dump>  <pass>
 UUID=0614a041-d4b2-4ca6-945e-bd5850381661 /              xfs     defaults,noatime,discard 0       1
 UUID=843b282d-fb0c-44c6-a972-58db7b7f6ca7 swap           swap    defaults,noatime,discard 0       2
 UUID=297ec2d8-8e55-4371-817a-e5a7d45d39ab /home          xfs     defaults,noatime,discard 0       2
 tmpfs                                     /tmp           tmpfs   defaults,noatime,mode=1777 0     0

Each line corresponds to one filesystem, each of which is broken into six fields. These fields are as follows:

 The device or UUID                            : Most current Linux systems no longer use the device in /etc/fstab, preferring the UUID. 
 The mount point                               : Indicates where to attach the filesystem. 
 The filesystem type                           : you may recognize, xfs as filesystem.
 Options                                       : Use long options spearated by comas. 
 Backup information for use by the dump command: You should always use a 0 in this field. 
 The filesystem integrity test order           : To ensure fsck always runs on the root first, always set this to 1 for the root filesystem and 2 for any other filesystems 
                                                       on a hard disk. Use o to disable the bootup check for everything else.  

When using mount, you can take some shortcuts if the filesystem you want to work with this in /etc/fstab. For ex.: if you ant to mount /home, you would simnply run sudo mount /home. You can also try to mount all entries at once in /etc/fstab that do not contain the noauto option with this command:

 # mount -a

Here find some meaning of these options:

 defaults: This uses the mount defaults: read-write mode, enable device files, executables, the setuid bit and so on. Use this when you don't want to give the filesystem any special options 
           but you do want to fill all fields in /etc/fstab.
 errors:   This ext2-specific parameter sets the kernel behaviour when the system has trouble mounting a filesystem. The default is normally errors=continue, meaning that the kernel should 
           return an error code and keep running. To have the kernel try to the mount again in read-only mode, use errors=remount-ro. The errors=panic settings tells the kernel (and 
           your system) to halt when there is a problem with the mount. 
 noauto:   This option tells a mount -a command to ignore the entry. Use this to prevent a boot-time mount of a removable-media device, such as a CD-ROM of floppy drive. 
 user:     This option allows unprivileged users to run mount on a particular entry, which can be handy for allowing access to CD-ROM drives. Because users can put a setuid-root file on removable 
           media with another system, this option also sets nosuid, noexec and nodev (to bar special device files).

Alternatives to /etc/fstab

Although the /etc/fstab file has been traditional way to represent filesystems and their mount points, two new alternatives have appeared. The first is /etc/fstab.d directory that contains individual filesystem configuration files (one for each filesystem). The idea is very similar to many other configuration directories that you'll see throughout this article. A second alternative is to configure systemd units for the filesystems. You'll learn more about systemd and its unit in Chapter 6. However, the systemd unit configuration is often generated from (or based on) the /etc/fstab file, so you may find some overlap on your system.

Filesystem Capacity

To view the size and utilization of your currently mounted filesystems, use the df command. The output should look like this:

 $ df
 Filesystem                   1K-blocks      Used Available Use% Mounted on
 dev                            8176964         0   8176964   0% /dev
 run                            8181872      1160   8180712   1% /run
 /dev/sda2                     73926384   7003272  66923112  10% /
 tmpfs                          8181872     43372   8138500   1% /dev/shm
 tmpfs                          8181872         0   8181872   0% /sys/fs/cgroup
 tmpfs                          8181872      6284   8175588   1% /tmp
 /dev/sda1                    152025672  31249100 120776572  21% /home
 /dev/sdb1                    961301832 910509872   5829812 100% /home/alice/hdd
 tmpfs                          1636372        16   1636356   1% /run/user/1000
 media:/home/oswin/Downloads/ 471280128 439905280   7365120  99% /mnt/shared
 media:/home/oswin/hdd/       488147456 484377088   3770368 100% /mnt/hdd
 /dev/sdc1                     15694864    510992  15183872   4% /run/media/alice/ECB3-9428

Here's a brief description of the fields in the df output:

 Filesystem: The filesystem device. 
 1k-Blocks:  The total capacity of the filesystem in blocks of 1024 bytes. 
 Used:       The number of occupied blocks. 
 Available:  The number of free blocks.
 Capacity:   The percentage of blocks in use. 
 Mounted on:  The mount point. 

It should be easy to see that the two filesystem /dev/sda1 and /dev/sda2 here are roughly 145GB and 71GB in size. does not equal the poucentage use. In fact, the space is there, but it is hidden in reserved blocks. Therefore, only the superuser can use the full filesystem space if the rest of the partition is fills up. This features is keeps system servers from immediately failing when they run out of disk space. If your disk fills up and you need to know where all of those space-hogging media files are, use the du command. With no arguments, du prints the disk usage of every directory in the directory hierarchy, starting at the current working directory. (That's kind of a mouthful, so just run cd /; du to get the idea. Press ctrl-C when you get bored). The du -s command turns on summary mode to print only the grand total. To evaluate a particular dorectory, change to that directory and run du -s *.

Note: The POSIX standard defines a block size of 512 bytes. However, this size is harder to read, so by default, the df and du output in most Linux distributions is in 1024-bytes blocks. If you insist to displaying the numbers in 512-blocks, set the POSIXLY_CORRECT environmentr variable. To explecitely specify 1024-byte blocks, use the -k option (both utilities support this). The df program also has a -m option to list capacities in 1MB blocks and a -h option to take a best guess at what a person can read.

Checking and Repairing Filesystems

The optimizations that Unix filesystems offer are made possible by a sophisticated database mechanism. For filesystems to work seamlessly, the kernel has to trust that there are no errors in a mounted filesystem. If errors exist, data loss and system crashes may result. Filesystem errors are usually due to a user shutting down the system in a rude way (for ex.: by pulling out the power cord). In such cases, the filesystem cache in memory may not match the data on the disk, and the system also may be in the process of altering the filesystem when you happen to give the computer a kick. Although a new generation of filesystems supports journals to make filesystem corruption far less common, you should always shut the system down properly. And regardless of the filesystem in use, filesystem checks are still necesary every now and then to maintain sanity.

The tool to check a filesystem isfsck. As with the mkfs program, there is a different version of fsck for each filesystem type that Linux supports. For ex.: when you run fsck on an Extended filesystem serie (ext2, ext3 or ext4), fsck recognizes the filesystem type and starts the e2fsck utility. Therefore, you generally don't need to type e2fsck, unless fsck can't figure out the filesystem type or you're looking for the e2fsck manual page. The information presented in this section is specific to the Extended filesystem series and e2fsck. To run fsck in interactive manual mode, give the device of the mount point (as listed in /etc/fstab) as the argument. For ex.:

 # fsck /dev/sdc1

Warning: You should never use fsck on a mounted filesystem because the kernel may alter the disk data as you run the check, causing runtime mismatches that can crash your system and corrupt files. There is only one exception: If you mount the root partition read-only in single-user mode, you may use fsck on it.

In manual mode, fsck prints verbose status reports on its passes, which should look something like this when there are no problems:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure 
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 /dev/sdc1: 11/1976 files (0.0% non-contiguous), 265/7891 blocks. 

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

If fsck finds a problem in manual mode, it's stop and asks you a question relevant to fixing problem. These questions deal with the internal structure of the filesystem, such as reconnection losse inodes and clearing blocks (an inode is a building block of the filesystem; you'll see how inodes work). When fsck asks you about reconnecting an inode, it has found a file that doesn't appear to have a name. When reconnecting such a file, fsck places the file in the lost+found directory in the filesystem, with a number as the filename. If this happens, you need to guess the name based on the content of the file; the original name is probably gone. In general, it's pointless to sit through the fsck repair process if you've just uncleanly shut down the system, because fsck may have a lot of minor errors to fix. Fortunately, e2fsck has a -p option that automatically fixes ordinary problems without asking and aborts when there's a serious error. In fact, Linux distributions run some variant of fsck -p at boot time. (You may also see fsck -a, wich does the same thing).

If you suspect a major disaster on your system, such as a hadrware failure or device misconfiguration, you need to decide on a course of action because fsck can really mess up a filesystem that has larger problems. (One telltale sign that your system has a serious problem is that <fsck> asks a lot of questions in manual mode).

If you think that something really bad has happend, try running fsck -n to check the filesystem without modyfing anything. If there's a problem with the device configuration that you think you can fix (such as an incorrect number of blocks in the partition table of loose cables), fix it before running the fsck for real, or you're likely to lose a lot of data.

If you suspect that only the superblock is corrupt (for ex.: because someone wrote to the beginning of the disk partition), you might be able to recover the filesystem with one of the superblock backups that mkfs creates. Use fsck -b num to replace the corrupted superblock with an alternate at block num and hope for the best.

If you don't know where find a backup superblock, you may be able to run mkfs -n on the device to view a list of superblock backup numbers without destroying your data. (Again, make sure that you're using -n option, or you'll really tear up the filesystem).

 #  mkfs -n /dev/sdc1   
 mke2fs 1.43.7 (16-Oct-2017)
 /dev/sdc1 contains a vfat file system
 Proceed anyway? (y,N) y
 Creating filesystem with 3927552 4k blocks and 983040 inodes
 Filesystem UUID: cce85721-4be8-44a9-9959-6774084015fa
 Superblock backups stored on blocks: 

32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

Checking ext3 and ext4 Filesystems

You normally do not need to check ext3 and ext4 filesystems manually because the journal ensures data integrity. However, you may wish to mount broken ext3 or ext4 filesystem in ext2 mode becuse the kernel will not mount an ext3 or ext4 filesystem with a nonempty journal. (If you don't shutdown your system cleanly, you can expect the journal to contain some data.) To flush the journal in an ext3 or ext4 filesystem to the regular databes, run e2fsck as follows:

 # e2fsck -fy /dev/disk_device

The worst case

Disk problems that are worse in severity leavy youy with few choices:

 * You can try to exctract the entire filesystem image from the disk with dd and transfer it to a partition on another disk of the same size. 
 * You can try to patch the filesystem as much as possible, mount it in read-only mode, and slvage what you can. 
 * You can try debugfs.

In the first two cases, you still need to repair the filesystem before you mount it, unless you feel like picking through the raw data by hand. If you like, you can choose to answer y to all of the fscdl questions by entering fsck -y</fsck>, but do this as a last resort because issues may come up during the repair process that you would rather handle manually.

The debugfs tool allows you to look through the files on a filesystem and copy them elsewhere. By default, it opens filesystems in read-only mode. If you're recovering data, it's probably a good idea to keep your files intact to avoid message things up further. Now, if you're really desperate, say with a catastrophic disk failure on your hands and no backups, there isn't a lot you can do other than hope a preofessional service can "scrape the platters".

Special-Purpose Filesystems

Not all filesystems represent storage on physical media. Specifically, most versions of Unix have filesystems that serve as system interfaces. That is, rather than serving only as a means to store data on a device, a filesystem can represent system information such as process ID's and kernel diagnostics. This idea goes back to the /dev mechanisme, which is an early model of using files for I/O interfaces. The /proc idea came from the eight edition of research Unix, implemented by Tom J. Killian and accelerated when Bell Labs (including many of the original Unix designers) created Plan 9 —— a research OS that look fielsystem abstraction to a whole new level.

The special filesystem types in common use on Linux include the following:

 proc: Mounted on /proc. The name proc is actually an abbrevation for process. Each numbered directory inside /proc is actually the process ID of a current process on the system; 
       the files in those directories represent various aspects of the processes. The file /proc/self represents the current process.  The Linux proc filesystem includes a great deal of additional
       kernel and hardware information in files like /proc/cpuinfo. 
 sysfs: Mounted on /sys. 
 tmpfs: Mounted on /run and other locations. With tmpfs, you can use your physical memory and swap space as temporary storage. For ex.: you can mount tmpfs where you like, using the size and nr_blocks 
        long options to control the maximum of size. However, be careful not to constantly pour things into tmpfs because your system will eventually run out of memory (oom) and prgrams will start to crash. 

Swap space

Not every partition on a disk contains a filesystem, it's also possible to augment the RAM on a machine with disk space. If you run out of real memory, the Linux virtual memory system can automatically move pieces of memory to and from a disk storage. This is called swapping because pieces of idle programs are swapped to the disk exchange for active pieces residing on the disk. The disk area used to store memory pages is called swap space (or just swap for short). The free command's output inscludes the current swap usage in kilobytes as follows:

 $ free
               total        used        free      shared  buff/cache   available
 Mem:       16363748     1470324      166684       69136    14726740    14486600
 Swap:       8365052           0     8365052

Using a Disk Partition as Swap Space

To use an entire disk partition as swap, follow these steps:

 1. Make sure the partition is empty. 
 2. Run makeswap dev, where dev is the partition device (ex.: /dev/sdc1). This command puts a swap signature on the partition. 
 3. Execute swapon dev to register the space with the kernel. 

After creating a swap partition, you can put a new swap entry in your /etc/fstav file to make the system use the swap space as soon as the machine boots. Here is a sample entry that uses /dev/sdc1 as a swap partition:

 /dev/sdc1 none swap sw 0 0

Keep in mind that many systems now use UUID's insead of raw device names.

Using a File as Swap Space

You can use a regular file as swap space if you're in a situation where you would be forced to repartition a disk in order to create a swap partition. You shouldn't notice any problem when doing this. Use these commands to create an empty file, initialize swap, and add it to swap pool:

 # dd if=/dev/zero of=swap_file bs=1024k count=num_mb
 # mkswap swap_file
 # swapon swap_file

Here, swap_file is the name of the new swap file, and num_mb is the desired size, in megabytes. To remove a swap partition or file from the kernel's active pool, use the swapoff command.

 # swapoff swap_file

Warning: It's dangerous to do this on a general-purpose machine. If a machine completely runs out of both real memory and swap space, the Linux kernel invokes the out-of-memory (OOM) killer to kill a process in order to free up some memory. You obviously don't want this to happen to your desktop applications. On the other hand, high-performance servers include sophisticated monitoring and load-balancing systems to ensure that they never reach the danger zone.

Inside a traditional Filesystem

A traditional Unix filesystem has two primary components: a pool of data block where you can store data and a database system that manages the data pool. The database is centered around the inode data structure. An inode is a set of data that describes a particular file, including its type, permissions, and —— perhaps most importantly —— where in the data pool the file data resides. Inodes are identifierd by numbers listed in an inode table. Filenames and directories are also implemented as inodes. A directory indode contains a list of filenames and corresponding links to other inodes. To provide a real-life example, I created a new filesystem, mounted it, and changed the directory to the mount point. Then, i added some files and directories with these commands (feel free to do this yourself with a flahs drive):

 $ mkdir dir_1
 $ mkdir dir_2
 $ echo a > dir_1/file_1
 $ echo b > dir_1/file_2
 $ echo c > dir_1/file_3
 $ echo d > dir_2/file_4
 $ ln dir_1/file_3 dir_2/file_5

Note that I created dir_2/file_5 as a hard link to dir_1/file_3, meaning that these two filenames actually represent the same file.

If you were to explore the directories in this filesystem, its contents would appear to the user as shown in image User-level-representation-of-filesystem.png. The actual layout of the filesystem, as shown below in image xxx.png, doesn't look nearly as clean as the user-level representation.

User-level-representation-of-filesystem.png
Inode structure of the filesystem.png

How do we make sense of this? For any ext2/ext3/ext4 filesystem, you start at inode number 2 —— the root inode. From the inode table, you can see that this is a directory inode (dir), so you can follow the arrow over to the data pool, where you see the contents of the root directory; two entries named dir_1 and dir_2 corresponding to inodes 12 and 7633, respictively. to explore those entries, go back to the inode table and look at either of those inodes. To examine dir_1/file_2 in this filesystem, the kernel does the following:

 1. Determines the path's components: a directory named dir_1, followed by a component named file_2. 
 2. Follows the root inode to its directory data. 
 3. Finds the same dir_1 in inodes 2's directory data. which points to inode number 12. 
 4. Looks up inode 12 in the inode table and verifies that it is a directory inode. 
 5. Follows inode 12's data link to its directory inforamtion (the second box down in the data pool). 
 6. Locates the second compnent of the path (file_2) in inode 12's directory data. This entry points to inode 14. 
 7. Looks up inode 14 in the directory table. This is a file node. 

At this point, the kernel knows the properties of the file and can open it by following inode 14's data link. This system, of inodes pointing to directory data structures and directory data structures pointing to inodes, allows you to create the filesystem hierarchy that you're used to. In addition, notice that the directory inodes contain entries for . (the current directory) and .. (the parent directory, execpt for the root directory). This makes it easy to get a point of reference and to navigate back down the directory structure.

Viewing Inode Details

To view the inode numbers for an directory, use the ls -i command. Here's what you'd get at the root of this example. (For more detailed inode information, use the stat command).

 $ ls -i
 12 dir_1 7633 dir_2

Now you're probably wondering about the link count. You've already seen the link count in the output of the common ls -l command, but you likly ignored it. How does the link count relate to the files, in particular the "hard linked" file_5? The link count field is the number of total direcotry entries (across all directories) that point to an inode. Most of the files have a link count of 1 because they occur only once in the directory entries. This is excepted: Most of the time when you create a file, you create a new directory entry and a new inode to go with it. However, inode 15 occurs twice: First it's created as dir_1/file_3, and then it's linked to as dir_2/file_5. A hard link is just a manually created entry in a directory to an inode that already exists. The ln command (without -s option) allows you to manually create new links.

This is also when removing a file is sometimes called unlinking. If you run rm dir_2/file_5, the kernel searches for an entry named file_2 in inode 12's directory entries. Upon finding that file_2 correspond to inode 14, the kernel removes the directory entry and then substracts 1 from inode 14's link count. As a result, inode 14's link count will be 0, and the kernel will know that there are no longer any names linkiong to the inode. Therefore, it can now delete the inode and any data associated with it.

However, if you run rm dir_1/file_3, the end results is that the link count of inode 15 goes from 2 to 1 (because dir_2/file_5 still points there), and the kernel knows not to remove the inode. Links count work much the same for directories. Observe that inode 12's link count is 2, because there are two inode links there: one for dir_1 in the directory entries for inode 2 and the second a self-reference (.) in its own directory entries. If you create a new directory dir_1/dir_3, the link count for inode 12 would go to 3 because the new directory would include a parent (..) entry that links back to inode 12, much as inode 12's parent link points to inode 2.

There is one small exception. The root inode 2 has a link count 4. However, we show only three directory links. The fourth link is in the filesystem's superblock because the superblock tells you where to find the root inode.

Don't be afraid to expiriment on your system. Creating a directory structure and then using ls -i or stat to walk through the pieces is harmless. You don't need to be root (unless you mount and create a new filesystem). But there's still one piece missing: When allocating data pool blocks for a new file, how does the filesystem know which blocks are in use and which are available? One of the most basic ways is with an additional management data structure called a block bitmap. In this scheme, the filesystem reserves a series of bytes, with each bit corresponding to one block in the data pool. A value of 0 means that the block is free, and a 1 means that it's in use. Thus, allcating and deallocating blocks is a matter of flipping bits.

Problems in a filesystem arise when the inode table data doesn't match the block allocation data or when the link counts are incorrect; this can happen when you don't cleanly shut down a system. Therefore, when you check a filesystem, the fsck program walks through the inode table and directory structure to generate new link counts and a new block allocation map (such as the block bitmap), and then it compares the newly generated data with the filesystem on the disk. If there are mismatches, fsck must fix the link counts and determine what to do with any inodes and/or data that didn't come up when it traversed the directory structure. Most fsck programs make these "orphans" new files in the filesystem's lost+found directory.

Working with Filesystems in User Space

When working with files and directories in user space, you shouldn't have to worry much about the implemantation going on below them. You're expected to access the consents of files and directories of a mounted filesystem through the kernel system calls. Curiosly, though, you do have access to certain filesystem information that doesn't seem to fit in user space —— in particular, the stat() system call returns inode numbers and link counts. When you maintaining a filesystem, do you have to worry about inode numbers and link counts? Generally, no. This stuff is accessible to user mode programs primarely for backward compatibility. Furthemore, not all filesystems available in Linux have these filesystem internals. The Virtual File System (VFS) interface layer ensures that system calls always return inode numbers and link counts, but those numbers may not necesarliy mean anything. You may not be able to perform traditional Unix filesystem operations on nontraditional filesystems. For ex.: you can't use ln to create a hard link on a mounted VFAT filesystem because the directory entry structure is entirely different.

Fortunately, the system calls availble to user space on Unix/Linux systems provide enough abstraction for painless file access —— you don't need to know anything about the underlying implementation in order to access files. In addition, filenames are flexible in format and mixed-case names are supported, making it easy to support other hierarchical-style filesystems. Remember, specific filesystem support does not necesarily need to be in the kernel. In user-space filesystems, the kernel only needs to act as a conduit for system calls.

The Evolution of Filesystems

As you can see, even the simple filesystem just described has many different components to maintain. At the same time, the demands placed on filesystems continuously increase with new tasks, technology, and storage capacity. Today's performance. data integrity, and security requirements are beyond the offerings of older filesystem implementations, so filesystem technology is constantly changing. We've already mentioned Btrfs as an example of a next-generation filesystem. One example of how filesystems are changing is that new filesystems use separated data structures to represent directories and filenames, rather than the directory inodes described here. They reference data blocks differently. Also, filesystems that optimize for SSD's are still evolving. Continuous change in the development of filesystems is the norm, but keep in mind that the evolution of filesystems doesn't change their purpose.

How the Linux kernel boots

This chapter will teach you how the kernel starts —— or boots. In other words, you'll learn how the kernel moves into memory up to the point where the first user process starts.

A simplified view of the boot process look like this:

 1. The machine's BIOS or boot firmware loads and runs a boot loader. 
 2. The boot loader finds the kernel image on disk, loads it into memory, and start it. 
 3. The kernel initializes the devices and its drivers. 
 4. The kernel mounts the root filesystem. 
 5. The kernel starts a program called init with a process ID of 1. This point is the user space start.
 6. init sets the rest of the system processes in motion. 
 7. At some point, init starts a process allowing you to log in, usually at the end or near the end of the boot. 

This chapter covers the first four stages, focusing on the kernel and boot loaders. Chapter 6 continues with the user space start. Your ability to identify each stage of the boot process will prove invaluable in fixing boot problems and understanding the system as a whole. However, the default behaviour in many Linux distribution often makes it difficult, if not impossible, to identify the first few boot stages as they proceed, so you'll probably be able to get a good look only after they've completed and you log in.

Startup Messages

Traditional Unix systems produce many diagnostic messages upon boot that tell you about the boot process. The messages come first from the kernel and then from processes and initialization procedures that init starts. However, these messages aren't pretty or consistent, and in some cases they aren't very informative. Most current Linux distributions do their best to hide them with splash screens, filler, and boot options. In addition, hardware improvements have caused the kernel to start much faster than before; the messages flahs so quickly, it can be difficult to see what is happening. There are two ways to view the kernel's boot and runtime diagnostic messages. You can:

 * Look at the kernel system log file. You'll often find this in /var/log/kern.log, but depending on how your system is configured, it might als be lumped together with a lot of other system logs 
   in /var/log/messages or elsewhere.
 * Use the dmesg command, but be sure to pipe the output to less because there will be much more than a screen's worth. The dmesg command use the kernel ring buffer, which is of limited size, 
   but most newer kernel have a large enough buffer to hold boot messages for a long time. 

Here's a sample of what you can expect to see from the dmesg command: ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [    0.000000] Linux version 4.9.61-1-lts (builduser@andyrtr) (gcc version 7.2.0 (GCC) ) #1 SMP Wed Nov 8 17:49:38 CET 2017
 [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-lts root=UUID=0614a041-d4b2-4ca6-945e-bd5850381661 rw quiet resume=UUID=843b282d-fb0c-44c6-a972-58db7b7f6ca7 nowatchdog systemd.legacy_systemd_cgroup_controller=true
 [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
 --snip--
 [    1.930066] tsc: Refined TSC clocksource calibration: 3912.021 MHz
 [    1.930080] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x70c76cc3954, max_idle_ns: 881591036823 ns
 [    2.940291] clocksource: Switched to clocksource tsc
 [   12.000112] hid-generic 0003:1B1C:1B40.0002: usb_submit_urb(ctrl) failed: -1
 [   12.002188] hid-generic 0003:1B1C:1B40.0002: timeout initializing reports
 [   12.002948] input: Corsair Corsair Gaming K63 Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-9/1-9:1.1/0003:1B1C:1B40.0002/input/input5
 [   12.070342] hid-generic 0003:1B1C:1B40.0002: input,hiddev0,hidraw4: USB HID v1.11 Keyboard [Corsair Corsair Gaming K63 Keyboard] on usb-0000:00:14.0-9/input1
 --snip--

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

After the kernel has started, the user-space startup procedure often generates messages. These messages will likely be more difficult to view and review because on most systems you won't find them in a single log file. Startup scripts usually print the message to the console and they're erased after the boot process finishes. Howerver, this usually isn't a problem because each script typically writes its own log. Some versions of init , such as Upstart and systemd, can capture diagnostic messages from startup and runtime that would normally go to the console.

Kernel Initialization and Boot Options

Upon startup, the Linux kernel initializes in this general order:

 1. CPU inspection
 2. Memory inspection
 3. Device bus discovery
 4. Device discovery
 5. Auxiliary kernel subsystem setup (networking, and so on)
 6. Root filesystem mount
 7. User space start

The first steps aren't too remarkable, but when the kernel gets to devices, a question of dependancies arises. For ex.: the disk device drivers may depend on bus support and SCSI subsystem support. Later in the initialization process, the kernel must mount a root filesystem before starting init. In general, you won't have to worry about any of this, except that some necessary components may be loadable kernel modules rather than part of the main kernel. On some machines, you may need to load these kernel modules before the true root filesystem is mounted. We'll cover this problem and its initial RAM filesystem workaround solution later. As of this writing, the kernel does not emit specific messages when it's about to start its firts user process. However, the following memory management messages are a good indication that the user-space handoff is about to heppen because this is where the kernel protects its own memory from user-space processes:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Freeing unused kernel memory: 100K
 Write protecting the kernel read text: 5820k
 Write protecting the kernel read-only data: 10240k
 NX-protecting the kernel data: 4420k

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

You may also see a message about the root filesystem being mounted at this point.

Note: Feel free to skip to chapter 6 to learn the specifics of user space start and the init program that the kernel runs as its first process. The remainder of this chapter details how the kernel starts.

Kernel Parameters

When running the Linux kernel, the boot loader passes in a set of text-based kernel parameters that tell the kernel how it should start. The parameters specify many different types of behaviour, such as the amount of diagnostic output the kernel should procuce and device driver-specific options. You can view the kernel parameters from your system's boot by looking at the /proc/cmdline file:

 $ cat /proc/cmdline
 BOOT_IMAGE=/boot/vmlinuz-linux-lts root=UUID=0614a041-d4b2-4ca6-945e-bd5850381661 ro quiet resume=UUID=843b282d-fb0c-44c6-a972-58db7b7f6ca7 nowatchdog splash vt.handoff=7
 systemd.legacy_systemd_cgroup_controller=true

The parameters are either simple one-word flags, such as rw and quiet, or key=value pairs, such as vt.handoff=7. Many of the parameters are unimportant, such as the splash flag for displaying a splash screen, but one that is critical is the root parameter. This is the location of the root filesystem; without it. the kernel cannot find init and therefore cannot perform the user space start. The rootfilesystem can be specified as a device file, such as /dev/sda1. However, on most modern desktop systems, a UUID is more common:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 root=UUID=0614a041-d4b2-4ca6-945e-bd5850381661

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The ro parameter is normal; it instruct the kernel to mount the root filesystem in read-only mode upon user space start. (Read-only mode ensures that fsck can check the root filesystem safely; after the check, the bootup process remounts the root filesystem in read-write mode.) Upon encountering a parameter that it does not understand, the Linux kernel saves the parameter. The kernel later passes the parameter to init when performing the user space start. For ex.: if you add -s to the kernel parameters, the kernel passes the -s to the init program to indicate that it should start in signe-user mode.

Now let's look at the mechanics of how boot loaders start the kernel.

Boot loaders

At the start of the boot process, before the kernel and init start, a boot loader start the kernel. The task of a boot loader sounds simple: It loads the kernel into memory, and then starts the kernel with a set of kernel parameters. But consider the question that the boot loader must answer:

 * Where is the kernel? 
 * What kernel parameters should be passed to the kernel when it starts? 

The answer are (typically) that the kernel ans its parameters are usually somewhere on the root filesystem. It sounds like the kernel parameters should be easy to find, except that the kernel is not yet running, so it can't traverse a filesystem to find the necesary files. Worse, the kernel device drivers normally used to access the disk are also unavailable. Think of this as a kind of "chicken or egg" problem.

Let's start with the driver concern. On PC's, boot loaders use the Basic Input/Output System (BIOS) or Unified Extensible Firmware Interface (UEFI) to access disks. Nearly all disk hardware firmware that allows the BIOS to access attached storage hardware  with Linear Block Addressing (LBA). Although it exhibits poor performance, this mode of access does allow universal acces to disks. Boot loaders are often the only programs to use the BIOS for disk access; the kernel uses its own high-performance drivers. The filesystem question is trickier. Most modern boot loaders can read partition tables and have built-in support for read-only access to filesystem. Thus, they can find and read files. This capability makes it far easier to dynamically configure and enchance the boot loader. Linux boot loaders have not always had this capability; without it, configuring the boot loader was more difficult. 

Boot loader Tasks

A Linux boot loader's core functionality includes the ability to do the following:

 * Select among multiple kernels. 
 * Switch between sets of kernel parameters.
 * Allow the user to manually override and edit kernel image names and parameters (for ex.: enter in single-user mode).
 * Provide support for booting other OS. 

Boot loaders have become considerably more advanced since the inception of the Linux kernel, with features such as history and menu systems, but the basic need has always been flexibility in kernel image and parameter selection. One interesting phenomen is that certain needs have diminished. For ex.: because you can now perform an emergency of recovery boot partially or entirly from a USB sorage device, you probably won't have to worry about manually entering kernel parameters of going into single-user mode. But modern boot loaders offer more power than ever, which can be particulary handy if you're building custom kernels or just want to tweak parameters.

Boot loader Overview

Here are the main boot loaders that you may encounter, in order of popularity:

 GRUB: A near-universal standard on Linux systems. 
 LILO: One of the first Linux boot loaders. ELILO is an UEFI version filesystems. 
 SYSLINUX: Can be configured to run from many different kinds of filesystems. 
 LOADLIN: Boots a kernel from MS-DOS. 
 efilinux: A UEFI boot loader intended to serve as a model and reference for other UEFI boot loaders. 
 coreboot (formerly Linux BIOS): A high-performance replacement for the PC BIOS that can include a kernel. 
 Linux Kernel EFISTUB: A kernel plugin for loading the kernel directly from the EFI/UEFI System Partition (ESP) found on recent systems. 

GRUB Introduction

GRUB stands for Grand Unified Boot Loader. We'll cover GRUB 2; there is also an older version now called GRUB legacy that is slowly falling out of use. One of GRUB's most important capabilities is filesystem navigation that allows for much easier kernel image and configuration selection. One of the best ways to see this in action and to learn about GRUB in general is to look at its menu. The interface is easy to navigate, but there's a good chance that you've never seen it. Linux distributions often do their best to hide the boot loader from you.

To access the GURB menu, press and hold SHIFT when you BIOS or firmware startup screen first appears. Otherwise, the boot loader configuration may not pause before loading the kernel. Figure 5-1 shows the GRUB menu. Press ESC to temporarily disable the automatic boot timeout after the GRUB menu appears.

Grub figure 5-1.png

Try the following to explore the boot loader:

 1. Reboot or power on your Linux system. 
 2. Hold down SHIFT during the BIOS/Firmware sef-test and/or splash screen to get the GRUB menu. 
 3. Press e to view the boot loader configuration commands for the default boot option. You should see something like Figure 5-2. 
Grub figure 5-2.jpg

This screen tells us that for this configuration, the root is set with a UUID, the kernel image is /boot/vmlinuz-3.2.0-22-generic-pae, and the kernel is /boot/initrd.img-3.2.0-24-generic-pae. But if you've never seen this sort of configuration before, you may find it somewhat confusing. Why are there multiple references to root, and why are they different? Why is insmod here? Isn't that a Linux kernel feature normally run by udevd?

The double-takes are warranted, because GRUB doesn't really use the Linux kernel —— it starts it. The configuration you see consists wholly of GRUB internal commands. GRUB really is an entirely separate world. The confusion stems from thew fact that GRUB borrows terminology from many sources. GRUB has its own "kernel" and its own insmod command to dynamically load GRUB modules, completely independant of the Linux kernel. Many GRUB commands are similar to Unix shell commands; there's even an ls command to list files.

But the most confusion comes from the use of the word root. To clear it up, there is one simple rule to follow when you're looking fo your system's root filesystem: Only the root kernel parameter will be the root filesystem when you boot your system.

In the GRUB configuration, that kernel parameter is somewhere after the image name of the Linux command. Every other reference to root in the configuration is to the GRUB root, which exists only inside of GRUB. The GRUB "root" is the filesystem where GRUB searches for kernel and RAM filesystem image files.

In figure 5-2, the GRUB root is first set to a GRUB-specific device (hd0-gpt6). Then in the following command, GRUB searches for a particular UUID on a partition. If it finds that UUID, it sets the GRUB root to that partition.

To wrap things up, the Linux command's first argument (/boot/vmlinuz-....) is the location of the Linux kernel image file. GRUB loads this file from the GRUB root. The initrd command is similar, specifying the file for the initial RAM filesystem.

You can edit this configuration inside GRUB; doing so is usually the easiest way to temporarly fix an erroneous boot. To permanently fix a boot problem, you'll need to change the configuration, but for now, let's go deeper and examine sonme GRUB internals with the command-line interface.

Exploring Devices and Partitions with the GRUB command line

As you can see in Figure 5-2, GRUB has its own device-addressing scheme. For ex.: the first hard disk found is hd0, followed by hd1, and so on. But device assignements are subject to change. Furtunately, GRUB can search all partitions for a UUID in order to find the one where the kernel resides, as you just saw with the search command.

Listing Devices To get a feel for how GRUB refers to the devices on your system, access the GRUB command line by pressing C at the boot menu or configuration editor. You should get the GRUB prompt:

 grub>

You can enter a command here that you see in a configuration, but to get started, try a diagnostic command instead: ls with no arguments, the output is a list of devices know to GRUB:

  • this is an example, not take from Figure 5-2!
 grub> ls
 (hd0) (hd0,msdos5)

In this case, there is one main disk device denoted by (hd0) and the partitions (hd0,msdos5). The msdos prefix on the partition tells that the disk contains MBR partition table; it would begin with gpt for GPT. (You will find even deeper combinations with a third identifier, where a BSD disklabel map resides inside a partition, but you won't normally have to worry about this unless you're running multiple OS on one machine.)

To get more detailed information, use ls -l. This command can be particulary useful because it displays any UUIDs of the partitions on the disk. For ex.:

 grub> ls -l
 Device hd0: not a known filesystem - Total size 426743808 sectors
             Partition hd0,msdos1: Filesystem type ext2 - Last modification time
             2015-09-18 20:45:00 Friday, UUID 4898e145-bo64-45bd-b7b4-7326b00273b7 -
 Partition start at 2048 - Total size 424644608 sectors
             Partition hd0-msdos5: Not a known filesystem - Partition start at
             424648704 - Total size 2093056 sectors

This particular disk has a Linux ext2/3/4 filesystem on the first MBR partition and a Linux swap signature on partition 5, which is a fairly common configuration. (You can tell that (hd0,msdos5) is a swap partition fron is output, though.)

File Navigation Now let's look at GRUB's filesystem navigation capabilities. Determine the GRUB root with the echo command (recall that this is where GRUB expects to find the kernel):

 grub> echo $root
 hdo,msdos1

To use GRUB's ls command to list the fils and directories in that root, you can append a forward slash to the end of the partition:

 grub> ls (hd0,msdos1)/

But it's a pain to remember and type the actual root partition, so use the root variable to save yourself sone time:

 grub> ls $root/boot

Note: Use the up and down arrow key's to flip through GRUB command history and the left and right arrows to edit the current command line. The standard readdline keys (ctrl-N, ctrl-P, and so one) also work.

You can view all currently set GRUB variables with the set command:

 grub> set
 ?=0
 color_highlight=black/white
 color_normal=white/black
 --snip--
 prefix=(hd0,msdos1)/boot/grub
 root=hd0,msdos1

One of the most important of these variables is $prefix, the filesystem and directory where GRUB expects to find its configuration and auxiliary support. We'll explore this in next section. Once you've finished with the GRUB command-line interface, enter the boot command to boot your current configuration or just press ESC to return to the GRUB menu. In any case, boot your system; we're going to explore the GRUB configuration, and that's best done when you have your full system available.

GRUB Configuration

The GRUB configuration directory contains the central configuration file (grub.cfg) and numerous loadable modules with a .mod suffix. (As GURB versions progress, these modules will move into subdirectories such as i386-pc.) The directory is usually /boot/grub or /boot'grub2. We won't modify grub.cfg directly, instead, we'll use the grub-mkconfig command (or grub2-mkconfig on Fedora).

Reviewing Grub.cfg First, take a loog at grub.cfg to see how GRUB initializes its menu and kernel options. You'll see that the grub.cfg file consists of GRUB commands, which usually begin with a number of initialization steps followed by a series of menu entries for different kernel and boot configurations. The initialization isn't complicated; it's a bunch of function definitions and video setup command like this:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 if loadfont /usr/share/grub/unicode.pf2 ; then
   set gfxmode=auto
   load video
   insmod gfxterm
 --snip--

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Later in this file you should see the available boot configurations, each beginning with the menuentry command. You should be able to read and understand this example based on what you learned in the preceding section:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 menuentry 'ArchLabs Linux' --class archlabs --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-0614a041-d4b2-4ca6-945e-bd5850381661' {
         load_video
         set gfxpayload=keep
         insmod gzio
         insmod part_msdos
         insmod xfs
         set root='hd0,msdos2'
         if [ x$feature_platform_search_hint = xy ]; then
           search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos2 --hint-efi=hd0,msdos2 --hint-baremetal=ahci0,msdos2  0614a041-d4b2-4ca6-945e-bd5850381661
         else
           search --no-floppy --fs-uuid --set=root 0614a041-d4b2-4ca6-945e-bd5850381661
         fi
         echo    'Loading Linux linux-lts ...'
         linux   /boot/vmlinuz-linux-lts root=UUID=0614a041-d4b2-4ca6-945e-bd5850381661 rw  quiet resume=UUID=843b282d-fb0c-44c6-a972-58db7b7f6ca7 nowatchdog systemd.legacy_systemd_cgroup_controller=true
         echo    'Loading initial ramdisk ...'
         initrd  /boot/initramfs-linux-lts.img
 }

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Watch for submenu commands. If your grub.cfg file contains numerous menuentry commands, most of them are probably wrapped up inside a submenu command for older versions of the kernel so that they don't crowd the GRUB menu.

Generating e New Configuration File If you want to make changes to your GRUB configuration, you won't edit your grub.cfg file directly because it's automatically generated and the systen occasionally overwrites it. You'll add you new configuration elsewhere, then run grub-mkconfig to generate the new configurtion.

To see how the configuration generation works, look at the very begining of grub.cfg. There should be comment lines such as this:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 ### BEGIN /etc/grub.d/00_header ###

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Upon further inspection, you'll find that every file in /etc/grub.d is a shell script that produces a piece of the grub.cfg file. The grub-mkconfig command itself is a shell script that runs everything in /etc/grub.d. Try it yourself as root. (Don't worry about overwritting your current configuration. This command by itself simply prints the configuration to the standard output).

 # grub-mkconfig

What if you want to add menu entries and other commands to the GRUB configuration? The short answer is that you should put your customizations into a new custom.cfg file in your GRUB configuration directory, such as /boot/grub/custom.cfg.

The long answer is a little more complicated. The /etc/grub.d configuration direcotry gives you two options: 40_custom and 41_custom. The first, 40_custom, is a script that you can edit yourself, but it's probably the least stabe; a package upgrade is likely to destroy any changes you make. The 41_custom script is simpler; it's just a series of commands that load custom.cfg when GRUB starts. (Keep in mind that if you choose this second option, your changes won't appear when you generate your configuration file.) The two options for custom configuration files aren't particulary extensive. You'll see additions in your particular distribution's /etc/grub.d directory. For ex.: Ubuntu adds memory tester boot options (memtest86+) to the configuration. To write and install a newly generated GRUB configuration file, you can write the configuration to your GRUB direcotry with the -o option to grub-mkconfig command, like this:

 # grub-mkconfig -0 /boot/grub/grub.cfg

Or if you're an Ubuntu user, just run install-grub. In any case, back up your old configuration, make sure that you're installing to the correct direcotry, and so on.

Now we're going to get into some of the more technical details of GRUB and boot loaders. Are you tired? Skip then to chapter 6 :-)

GRUB Installation

Installing GRUB is more involved than configuring it. Fortunately, you won't normally have to worry about installation because your distribution should handle it for you. However, if you're trying to duplicate or restore a bootable disk, or preparing your own boot sequence, you might need to install it on your own.

Build the GRUB software set and determine where your GRUB directory will be; thew default is /boot/grub. You may not need to build GRUB if your distribution does it for you, but if you do, see Chapter 16 for how to build software from source code. Make sure that you build the correct target: It's different from MBR of UEFI boot (and there are even differences between 32-bit and 64-bit EFI).

Installing GRUB on your system

Installing the boot loader requires that you or an installer determine the following:

 * The target GRUB directory as seen by your currently running system. That's usually /boot/grub, but it might be different if you're installing GRUB to another disk for use on another system. 
 * The current device of the GRUB target disk. 
 * For UEFI booting, the current mount point of the UEFI boot partition. 

Remember that GRUB is a modular system, but in order to load modules, it must read the filesystem that contains the GRUB directory. Your task is to construct a version of GRUB capable to reading that filesystem so that it can load the rest of its configuration (grub.cfg) and any required modules. On Linux, this usually means bulding a version of GRUB with its ext2.mod module preloaded. Once you have this version, all you need to do is place it on the bootable part of the disk and place the rest of the required files into /boot/grub. Fortunately, GRUB comes with a utility called grub-install (not to be confused with Ubuntu's install-grub), which performs most of the work of installing the GRUB files and configuration for you. For ex.: if your current disk is at /dev/sda and you want to install GRUB on that disk with the current /boot/grub directory, use this command to install GRUB on the MBR:

 # grub-install /dev/sda

warning: Incorrectly installing GRUB may break the bootup sequence on your system, so don't take this command lightly. If you're concerned, read up on how to back up your MBR with dd. back up any other currently installed GRUB directory, and make sure that you have an emergency bootup plan.

Installing GRUB on an External Storage Device

To install GRUB on a storage device outside the current system, you must manually specify the GRUB directory on that device as your current system now sees it. For ex.: say that you have a target device of /dev/sdc and that device's root/boot filesystem (for ex.: /dev/sdc1) is mounted on /mnt of your current system. This implies that when you install GRUB, your currrent system will see the GRUB files in /mnt/boot'grub. When running grub-install, tell it where those files should go as follows:

 # grub-install --boot-direcotry=/mnt/boot /dev/sdc

Installing GRUB with UEFI

UEFI installation is supposed to bed easier, because you all you need to do is copy the boot loader into place. But you als need to "announce" the boot loader to the firmware with the efibootmgr command. The grub-install command runs this if it's availble, so in theory all you need to do to install on an UEFI partition is the following:

 # grub-install --efi-direcotry=efi_dir --bootloader-id=name

Here, efi_dir is where the UEFI direcotry appears on your current system (usually /boot/efi/efi, because the UEFI partition is often mounted at /boot/efi) and name is an identifier for the boot loader, as described in Section 5.8.2.

Unfortunately, many problems can crop up when installing a UEFI boot loader. For ex.: if you're installing to a disk that will eventually end up in another system, you have to figure out how to announce that boot loader to the new system's firmware. And there are differences in the install procedure for removable media. But one of the biigest problem is UEFI secure boot.

UEFI Secure Boot Problems

Uefi linux.jpeg

One of the newest problems affecting Linux installation is the secure boot feature found on recent PCs. When active, this mechanisme is UEFI requires boot loaders to be digitally signed by a trusted authority in order to run. Microsoft has required vendors shipping Windows 8 to use secure boot. The result is that if you try to install an unsigned boot loader (which is most current on Linux ditributions), it will not load. The easiest way around this for anyone with no interest in Windows is to disble secure boot in the UEFI settings. However, this won't work cleanly for dual-boot systems and may not be an option for all users. Therefore, Linux distributions are offering signed boot loaders. Some solution are just frontends to GRUB, some offer a fully signed loading sequence (from the boot loader to the kernel), and others are entirely new boot loaders (some based on efilinux).

Chainloading Other Operating Systems

UEFI makes it relatively easy to support loading other operating systems because you can install multpile boot loader in the EFI partition. However, the older MBR style doesn't support it, and even if you do have UEFI, you may still have an individual partition with an MBR-style boot loader that you want to use. You can get GRUB to load and run a different boot loader on a specific partition on your disk by chainloading. To chainload, create a new menu entry in your GRUB configuration (using one of the methods explain in Chapter 5.5.2). Here's an example for WIndows installation on the third partition of a disk:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 menuentry "Windows" {
         insmod chain
         insmod ntfs
         set root=(hd0,3)
         chainloader +1
 }

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The +1 option to chainloader tells it to load whatever is at the first sector of a partition. You can also get it to directly load a file by using a line like this to load the io.sys MS-DOS loader:

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 menuentry "DOS" {
         insmod chain
         insmod fat
         set root=(hd0,3)
         chainloader /io.sys
 }

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Boot Loader Details

Now we'll look quickly at some boot loader internals. Feel free to skip to the next chapter if this material doesn't interest you. To understand how boot loader like GRUB work, let's first survey how a PC boots when you turn it on. Due to the repeated inadequacies of traditional PC boot mechanisms, there are several variations, but there are two main schemes: MBR and UEFI.

MBR boot

In addition to the partition information described in Section 4.1, the Master Boor Record (MBR) includes a small area (441 bytes) that the PC BIOS loads and executes after its Power-On Self-Test (POST). Unfortunately, this is too little storage to house almost any boot loader, so additional space is necessary, resulting in what is sometimes called a multi-stage boot loader. In this case the intial piece of code in the MBR does nothing other than load the rest of the boot loader code. The remaining pieces of the boot loader are usually stuffed into the space between the MBR and the first partition on the disk. Of course, this isn't terribly secure because anything can overwrite the code there, but most boot loaders do it, including most GRUB installations. In addition, this scheme won't work with a GPT-partitioned disk using the BIOS to boot because the GPT table information resides in the area after the MBR. (GPT leaves the traditional MBR alone for backward compatibility.)

The workaround for GPT is to create a small partition called a BIOS boot partition with a special UUID to give the full boot loader code a place to reside. But GPT is normally used with UEFI, not the traditional BIOS., so which leads us to the UEFI boot scheme.

UEFI boot

PC manufacturers and software companies realized that the traditional PC BIOS is severely limited, so they decided to develop a replacement called Extensible Firmware Interface (EFI). EFI took a while to catch on for most PCs, but now it's fairly common. The current standard is Unified EFI (UEFI), which includes features such as built-in shell and the ability to read parition tables and navigate filesystems. The GPT partitioning scheme is part of the UEFI standard.

Booting is radically different on UEFI systems and, for the most part, much easier to understand. Rather than executable boot code residing outside of a filesystem, there is always a special filesystem called the EFI system Partition (ESP), which contains a direcotry named efi. Each boot loader has its own identifier and a corresponding subdirectory, such as efi/microsoft, efi/apple, or efi/grub. A boot loader file has an .efi extension and resides in one of these subdirecotries, along with other supporting files.

Note: The ESP differs from the BIOS boot partition described in Section 5.8.1 and has a different UUID.

There's a wrinkle, though: You can't just put old boot loader code into the ESP because that code was written for the BIOS interface. Instead, you must provide a boot loader written for UEFI. For ex.: when using GRUB, you have to install the UEFI version of GRUB rather than the BIOS version. In addition, you must "announce" new boot loaders to the firmware. And, as mentioned in Section 5.6, we have the "secure boot" issue.

How GRUB Works

Let's wrap up our discussion of GRUB by looking at how it does its work:

 1. The PC BIOS or firmware initializes the hardware and searches its boot-order storage devices for boot code. 
 2. Upon finding the boot code, the BIOS/firmware loads and executes it. This is where GRUB begins. 
 3. The GRUB core loads. 
 4. The core initializes. At this point, GRUB can now access disks and filesystems.
 5. GRUB identifies its boot partition and loads a configration there. 
 6. GRUB gives the user a chance to change the configuration. 
 7. After a timeout or user action, GRUB executes the configuration (the sequence of command in the grub.cfg file). 
 8. In the course of executing the configuration, GRUB may load additional code (modules) in the boot partition.
 9. GRUB executes a boot command to load and execute the kernel as specified by the configuration's Linux command. 

Steps 3 and 4 of the preceding sequence, where the GRUB core loads, can be complicated due to the repeated inadequancies of traditional PC boot mechanisms. The biggest question is "Where is the GRUB core?" There are three basic posibilities:

 * Partially stuffed between the MBR and the beginning of the first partition. 
 * In a regular partition. 
 * In a special boot partition: a GPT boot partition, EFI System Partition (ESP), or elsewhere. 

In all cases except where you have an ESP, the PC BIOS loads 512 bytes from the MBR, and that is where GRUB starts. This little piece (derived from boot.img in the GRUB directory) isn't yet the core, but it contains the start location of the core and loads the core from this point. However, if you have an ESP, the GRUB core goes there as a file. The firmware can navigate the ESP and direclty execute the GRUB core or any other OS loader located there.

Still, on most systems, this is not the complete picture. The boot loader might also need to load an initial RAM filesystem image into memory before loading and executing the kernel. That's what the initrd configuration parameter in Section 6.8 specifies. But before you learn about the initial RAM filesystem, you should learn about the user space start —— that's where the next chapter begins.

How User Space Starts

The point where the kernel starts its first user-space process, init, is significant —— not just because that's where the memory system operation, but because that's where you can see how the rest of the system build up as a whole. Prior to this point, the kernel executes a well-controlled path of execution defined by a relatively small number of software developers. User space is far more modular. It's much easier to see what goes into the user space startup and operation. For the adventurous, it's also relatively easy to change the user space startup because doing so requires no low-level programming.

User space starts in roughly this order:

 1. init
 2. Essential low-level services such as udevd and syslogd
 3. Network configuration
 4. Mid- and high-level services (cron, printing, and so on)
 5. Login prompts, GUIs, and other high-level applications. 

Introduction to init

The init program is a user-space progame like any other program on the Linux system, and you'll find it in /sbin along with many of the other system binaries. Its main purpose is to start and stop the essential service processes on the system, but newer version have more responsabilities.

There are three major implemantations of init in Linux ditributions:

 System V unit: A traditional sequenced init (Sys V, usually pronouncend "sys-five"; not in common use on recent systems). 
 systemd: The standard init for all mainstream Linux distributions.
 Upstart: The init on Ubuntu installations prior to version 15.04. 

There are various other version of init as well, expecially on embedded platforms. For ex.: Android has its own init. The BSDs also have their version of init, but you are unlikely to see them on a modern Linux machine. (Some distributions have also modified the System V init configuration to resemble the BSD style.) There are many different implementations of init because System V init and other older versions relied on a sequence that performed only one startup task at a time. Under this scheme, it is relatively easy to resolve dependencies. However, performance isn't terrible good, because two parts of the boot sequence cannot normally run at once. Another limitation is that you can only start a fixed set of services as defined by the boot sequence: When you plug in a new hardware or need a service that isn't already running, there is no standardized way to coordinate the new components with init. systemd and Upstart attempt to remedy the performance issue by allowing many services to start in parralel thereby speeding up the boot process. Their implementations are quitte different, though:

 * systemd is goal oriented. You define a target that you want to achieve, along with its dependencies, and when you want to reach the target. systemd satisfies the dependancies and resolves 
   the target. systemd can also defer the start of a service until it is absolutely needed. 
 * Upstart is reactionary. It receives events and, based on those events, runs jobs that can in turn produce more events, causing Upstart to run more jobs, and so one. 

The systemd and Upstart init systems also offer a more advanced way to start and track services. In traditional init systems, service deamons are expected to start themselves from scripts. A script runs a deamon program, which detaches itself from the script and runs autonomously. To find the PID of a service daemon, you need to use ps or some other mechanisme specific to the service. In contrast, Upstart and systemd can manage individual service daemons from the beginning, giving the user more power and insight into exactly what is running on the system. Because the new init systems are not script-centric, configuring services for them als tends to be easier. In particular, System V init scripts tend to contain many similar commands designed to start, stop, and restart services. You don't need all of this redundancy with systemd and Upstart, which allow you to concentrate on the services themselves, rather tham their scripts.

Finally, systemd and Upstart both offer some level of on-demand services. Rather than trying to start all the services that may be necessary at boot time (as the System V init would do), they start some service only when needed. This idea is not really new; this was done with the traditional inetd daemon, but the new implementations are more sophisticated. Both systemd and Upstart offer some System V backward compatibility. For ex.: both support the concept of runlevels.

System V Runlevels

At any given time on a Linux system, a certain base set of processes (such as crond and udevd) is running. In System V init, this state of the machine is called its runlevel, which is denoted by a number from 0 to 6. A system spends most of its time in a single runlevel, but when you shut down the machine down, init switches to a different runlevel in order to terminate the system services in an orderly fashion and to tell the kernel to stop. You can check your system's runlevel with the who -r command. A system running Upstart reponds with something like this:

 $ who -r
 run-level 2 2015-09-06 14:23

This output tell us that the current runlevel is 2, as well as the date and time that the runlevel was established. Runlevels serve various purpose, but the most common one is to distinguish between system startup, shutdown, single-user mode, and console mode states. For ex.: Fedora-based systems traditionally used runlevels 2 through 4 for the txt console; a runlevel 5 means that the system will start a GUI login.

But runlevels are becoming a thing of the past. Even though all three init version support them, systemd and Upstart consider runlevels obsolete as end states for the system. To systemd and Upstart, runlevels exist primarily to start services that suppport only the System V init scripts, and the implementations are so different that even if you're familiar with one type of init, you won't necessarily know what to do with another.

Identifying Your init

Before proceeding, you need to determine your system's version fo init. If you're not sure, check your system as follows:

 * If your system has /usr/lib/systemd and /etc'systemd directories, you have systemd. 
 * If you have an /etc/init directory that contains several .conf files, you're probably running Upstart. 
 * If neither of the above is true, but you have an /etc/inittab file, you're probably running System V init. 

If your system has manual pages installed, viewing the init(8) manual page should help identify your version.

systemd

The systemd init is one of the newest init implementations on Linux. In addition to handling the regular boot process, systemd aims to incorporate a number of standard Unix services such as cron and inetd. As such, it takes some inspiration from Apple's launchd. One of the most significant features is its ability to defer the start of services and operating system features until they are necessary. There are many systemd features that it can be very difficult to know where to start learning the basics. Let's outline what happens when systemd runs at boot time:

 1. systemd loads its configuration. 
 2. systemd determines its boot goal, which is usually named default.target.
 3. systemd determines all of the dependencies of the default boot goal, dependencies of these dependencies, and so on. 
 4. systemd activates the dependencies and the boot goal. 
 5. After boot, systemd can react to system events (such as uevents) and activate additional components. 

When starting services, systemd does not follow a rigid sequence. As with other modern init systems, there is a considerable amount of flexibility in the systemd bootup process. Most systemd configurations deliberatly try to avoid any kind of startup sequence, preferring to use other methods to resolve strict dependencies.

Units and Unit types

One of the most interesting things about systemd is that is that it does not just operate processes and services; it can aslo mount filesystems, monitor network sockets, run timers, and more. Each type of capability is called a unut type, and each specific capability is called a unit. When you turn on a unit, you activate it. Rather than describe all of the unit types (you'll find them in the systemd(1) manual page), here's a look at a few of the unit types that perform the boot-time tasks in any Unix system:

 Service units: Control the traditional service daemons on a Unix system. 
 Mount units:   Control the attachement of filesystems to the system. 
 Target units: Control other units, usually by grouping them. 

The default boot goal is usually a target unit that groups together a number of service and mount units as dependencies. As a result, it's easy to get a partial picture of what's going to happen when you boot, and you can even create a dependency tree diagram with the systemctl dot command. You'll find the tree to be quite large on a typical system, because many units don't run by default. Figure 6-1 shows a part of the dependency tree for the default.target unit found on a Fedora system. When you activate that unit, all of the units below it on the tree also activate.

Figure6-1.png

systemd Dependencies

Boot-time and operational dependencies are more complicated than they may seen at first because strict dependencies are too inflexible. For ex.: imagine a scenario in which you want to display a login prompt after starting a database server, so you define a dependency from the login prompt to the database server. However, if the database server fails, the login prompt will also fail due to that dependency, and you won't even be able to log in to your machine to fix it. Unix boot-time tasks are fairly fault tolerant and can often fail without causing serious problems for standard services. For ex.: if a data disk for a system was removed but its /etc/fstab entry remained, the initial filesystem mount would fail. However, that failure typically wouldn't seriously affect standard operating system operation. To accommodate the need for flexibility and fault tolerance, systemd offers a myriad of dependency types and styles. We'll label them by their keyword syntax, but we won't go into details about configuration syntax until Section 6.4.3. Let's first look at the basic types:

 Requires : Strict dependencies. When activating a unit with a Requires dependency unit, systemd attempts to activate the dependency unit. If the dependency unit fails, systemd 
 deactivates the dependent unit. 
 Wants    : Dependencies for activation only. Upon activating a unit, systemd activates the unit's Want dependencies, but it doesn't care if those dependencies fail. 
 Requisite: Units that must already be active. Before activatin a unit with a Requisite dependency, systemd first checks the status of the dependency. If the dpenedency has not 
 been activated, sustemd fails on activation of the unit with the dependency. 
 Conflicts: Negative dependencies. When activating a unit with a Conflict dependency, systemd automatically deactivates the dependency if it is active. Simultaneous activation 
 of two conflicting units fails. 

Note: The wants dependency type is especially significant because it does not propagate failures to other units. The systemd documentation states that this is the way you should specify dependencies if possible, and it's easy to see why. This behaviour produces a much more robust system, similar to that of a traditional init.

You can also attach dependencies "in reverse". For ex.: in order to add Unit A as a Wants dependency to Unit B, you don't have to add the Wants in Unit B's configuration. Instead, you can install it as a WantedBy in Unit A's configuration. The same is true of the RequiredBy dependency. The configuration for (and result of) a "By" dependency is slightly more involved than just editing a configuration file; see "Enabling Units and the [Install] Section".

You can view a unit's dependencies with the systemctl command, as long as you specify a type of dependency, such as Wants or Requires:

 # systemctl show -p type unit

Ordering

None of the dependency syntac that you've seen so far explicitly specifies order. By default, activating a unit with a Requires or Wants causes systemd to activate all of these dependencies at the same time as the first unit. This is optimal, because you want to start as many services as possible as quickly as possible to reduce boot time. However, there are situations when one unit must start after another. For instance, in the system that Figure 6-1 is based on, the default.target unit is set to start after multi-user.service (this order distinction is not show in the figure).

To activate units in a particular order, you can use the following dependency modifiers:

 Before: The current unit will activate before the listed unit(s). For ex.: if Before=bar.target appears in foo.target, systemd activates foo.target before bar.target.
 After:  The current unit activates after the listed unit(s).

Conditional Dependencies

Several dependency condition keyword operate on various operation system states rather than systemd units. For ex.:

 ConditionPathExist=p True if the (file) path p exists in the system. 
 ConditionPathIsDirectory=p True if p is a directory. 
 ConditionFileNotEmpty=p True if p is a file and it's not zero-lenght. 

If a conditional dependency in a unit is false when systemd tries to activate the unit, the unit does not activate, though this applies only to the unit in which it appears. Therefore, if you activate a unit that has a condition dependency as well as some other unit dependencies, systemd attempts to activate the unit dependencies regardless of whether the condition is true or false. Other dependencies are primarily variations on the preceding. For ex.: the RequiresOverridable dependency is just like Requires when running normally, but it acts like a Wants dependency if a unit is manually activated. (For a full list, see the systemd.unit (5) manual page.)

systemd Configuration

The systemd configuration files are spread among many directories across the system, so you typically won't find the files for all of the units on a system in one place. That said, there are two main directories for systemd configuration: the system unit direcotry (globally configured, usually /usr/lib/systemd/system) and a system configuration directory (local definitions usually /etc/systemd/system). To prevent confusion, stick to this rule: Avoid making changes to the system unit directory because your distribution will maintain it for you. Make your local changes to the system configuration direcotry. so when given the choice between modifying something in /usr and /etc, always change /etc.

Note: You can check the current systemd configuration search path (including precedence) with this command:

 # systemctl -p UnitPath show

However, this particular setting comes from a third source: pkg-config settings. To see the system units and configuration directories on your system, use the following commands:

 $ pkg-config systemd --variable=systemdsystemunitdir
 $ pkg-config systemd --variable=systemdsystemconfdir

Unit Files

Unit files are derived from the XDG Desktop Entry Specification (for .desktop files, which are very similar to .ini files on Microsoft systems), with section names in brackets ([]) and variable and value assignments (options) in each section.

Consider the example unit file media.mount in /usr/lib/systemd/system , which is standard on Fedora installations. This file represents the /media tmpfs filesystem, which is a container directory for mounting removable media.

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [Unit]
 Description=Media Directory
 Before=local-fs.target
 
 [Mount]
 What=tmpfs
 Where=/media
 Type=tmpfs
 Options=mode=755,nosuid,nodev,noexec

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

There are two section here. The [Unit] section gives some details about the unit and contains description and dependency information. In particular, this unit is set to activate before the local-fs.target unit. The [Mount] section details the unit as being a mount unit, and it gives the details on the mount point, the type of filesystem, and the mount options. The What= variable identifies the device UUID of the device to mount. Here, it's set to tmpfs because the filesystem does not have a device. (For a full list of mount option, see the systemd.mount(5) manual page).

Many other unit configuration files are similarly straightforward. For ex.: the service unit file sshd.service enables secure shell logins:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [Unit]
 Description=OpenSSH server Daemon
 Wants=sshdgenkeys.service
 After=sshdgenkeys.service
 After=network.target
 
 [Service]
 ExecStart=/usr/bin/sshd -D $OPTIONS
 ExecReload=/bin/kill -HUP $MAINPID
 KillMode=process
 Restart=always
 
 [Install]
 WantedBy=multi-user.target

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Because this is a service target, you'll find the details about the service in the [Service] section, including how to prepare, start and reload the service. You'll find a complete listing in the systemd.service(5) manual page (and in systemd.exec(5)), as well as in the discussion of process tracking in Section 6.4.6.

Enabling Units and the [Install] Section

The [Install] section in the sshd.service unit file is important because it helps us to understand how to use systemd's WantedBy and RequiredBy dependency options. It's actually a mechanisme for enabling units without modyfing any configuration files. During normal operation, systemd ignores the [Install] section. However, consider the case when sshd.service is disabled on your system and you would like to turn it on. When you enable a unit, systemd reads the [Install] section; in this case, enabling the sshd.service unit causes systemd to see the WantedBy dependency for multi-user.target. In response, systemd creates a symbolic link to sshd.service in the system configuration directory as follow:

 $ ls -s '/usr/lib/systemd/system/sshd.service'  '/etc/systemd/system/multi-user.target.wants/sshd.service'

Notice that the symbolic link is placed into a subdirectory corresponding to the dependent unit (mulit-user.target in this case). The [Install] section is usually responsible for the the .wants and .requires direcotries in the system configuration direcotry (/etc/systemd/system). However, there are also .wants direcotries in the unit configuration (/usr/lib/systemd/system/), and you may also add links that don't correspond to [Install] sections in the unit files. These manual additions are a simple way to add a dependency without modifying a unit file that may be overwritten in the future (by a software upgrade, for instance).

Note: Enabling a unit is different from activating a unit. When you enable a unit, you are installing it into systemd's configuration, making semipermanent changes that will survive a reboot. But you don't always need to explicitly enable a unit. If the unit file has an [Install] section, you must enable it with systemctl enable; otherwise, the existence of the file is enough to enable it. When you activate a unit with systemctl start, you're just turning it on in the current runtime environment. In addition, enabling a unit does not activate it.

Variables and Specifiers

The sshd.service unit file also show use of variables —— specifically, the $OPTIONS and $MAINPID environment variables that are passed in by systemd. $OPTIONS are options that you can pass to sshd when you activate the unit with systemctl, and $MAINPID is the tracked process of the service (see section 6.4.6). A specifier is another variable-like feature often found in unit files. SPecifiers start with poucent (%). For ex.: the %n specifier is the current unit name. and the %H specifier is the current hostname.

Note: The unit name can contain some interesting specifiers. You can parameterize a single unit file in order to spawn multiple copies of a service, such as getty processes running on tty1, tty2, and so on. To use these specifiers, add the @ symbol to the end of the unit name. For getty, create a unit file named getty@.service, which allows you to refer to units such as getty@tty1 and getty@tty2. Anything after the @ is called the instance, and when processing the unit file, systemd expands the %I specifier to the instance. You can see this in action with the getty@.service unit files that come with the most ditributions running systemd.

systemd Operation

You'll interact with systemd primarily through the systemctl command, which allows you to activate and deactivate services, list status, reload the configuration, and much more.

The most essential basic commands deal with obtaining unit information. For ex.: to view a list of active units on your system, issue a list-units command. (This is actually the default command for systemctl, so you don't need the list-units</code part.):

 $ systemctl list-units

The output format is typically of a Unix information-listing command. For ex.: the header and the line for media.mount would look like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 UNIT               LOAD     ACTIVE SUB     JOB   DESCRIPTION
 media.mount        loaded   active mounted Media Directory

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

This command produces a lot of output, because a typically system has numerous active units, but it will still be abridged because systemctl truncates any really large unit names. To see full names of the units, use the --full option, and to see all units (not just active), use the --all option. A oarticularly useful systemctl operation is getting the status of a unit. For ex.: here is a typical status command and its output:

 $ systemctl status media.mount
 ● media.mount - Media Directory
    Loaded: loaded (/usr/lib/systemd/system/media.mount; static; vendor preset: disabled)
    Active: active (mounted) since Fri 2017-11-17 12:29:26 CET; 6s ago
     Where: /media
      What: tmpfs
   Process: 18209 ExecMount=/usr/bin/mount tmpfs /media -t tmpfs -o mode=755,nosuid,nodev,noexec (code=exited, status=0/SUCCESS)
     Tasks: 0 (limit: 4915)
    CGroup: /system.slice/media.mount
 
 nov 17 12:29:26 game systemd[1]: Mounting Media Directory...
 nov 17 12:29:26 game systemd[1]: Mounted Media Directory.

Notice that there is much more information output here that you would see on any traditional init system. You get not only the state of the unit but also the exact command used to perform the mount, its PID, and its exit status. One of the most interesting pieces of the output is the control group name. In the preceding example, the control group doesn't include any information other than the name /system.slice/media.mount because the unit's processes have already terminated. However, if you get the status of a service unit such as NetworkManager.service, you'll also see the process tree of the control group. You can view control groups without the rest of the unit status with the systemd-cgls command. You'll learn more about control groups in Section 6.4.6. The status command also display recent information from the unit's journal (a log that records diagnostic information for each unit). You can view a unit's entire journal with this command:

 $ journalctl _SYSTEMD_UNTI=unit

(This syntax is a bit odd because journalctl can access the logs of more than just a systemd unit.)

To activate, deactivate, and restart units, use the systemd start,stop, and restart commands. However, if you've changed a unit configuration file, you can tell systemd to reload the file in one of two ways:

 systemctl reload unit   Reloads just the configuration for unit. 
 systemctl daemon-reload Reloads all unit configurations. 

Requests to activate, reactivate, and restart units are known as jobs in systemd, and they are essentially unit state changes. You can check the current jobs on a system with:

 $ systemctl list-jobs

If a system has been running for some time, you can reasonably expect there to be no active jobs on it because all of the activations should be complete. However, at boot itme, you can sometimes log in fast enough to see some units start so slowly that they are not yet fully active. For ex.:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 JOB  UNIT                        TYPE        STATE
  1   graphical.target           start       waiting
  2   multi-user.target          start       waiting
 71   systemd-...nlevwel.service start       waiting
 76   sendmail.service           start       running 

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

In this case, job 76, the sendmail.service unit startup, is taking a really long timw. The other listed jobs are in waiting state, most likely because they're waiting for job 76. When sendmail.service finishes starting and becomes fully active, job 76 will complete, the rest of the jobs will also complete, and the job list will be empty.

Note: The term job can be confusing, especially because Upstart, another init system described in this chapter, uses the word job ot (roughly) refer to what systemd calls a unit. It's important to remember that although a systemd job associated with a unit will terminate, the unit itself can be active and running afterwards, especially in the case of service units.

See Section 6.7 for how to shut down and reboot the system.

Adding Units to systemd

Adding units to systemd is primarily a matter of creating, then activating and possibly enabling, unit files. You should normally put your own unit files in the system configuration directory /etc/systemd/system so that you won't confuse them with anything that came with your distribution and so that the distribution won't overwrite them when you upgrade. Because it's easy to create target units that don't do anything and don't interfere with anything, you should try it. Here's how to create two targets, one with a dependency on the other:

1. Create a unit file named test1.target: —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [Unit]
 Description=test1

————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— 2. Create a test2.target file with a dependency on test1.target: —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [Unit]
 Description=test2
 Wants=test1.target

————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— 3. Activate the test2.target unit (remember that the dependency in test2.target causes systemd to activate test1.target when you do this):

 # systemctl start test2.target

4. Verify that both units are active:

 # systemctl status test?.target
 ● test1.target - test1
    Loaded: loaded (/etc/systemd/system/test1.target; static; vendor preset: disabled)
    Active: active since Fri 2017-11-17 13:16:17 CET; 3s ago
 
 nov 17 13:16:17 game systemd[1]: Reached target test1.
 
 ● test2.target - test 2
    Loaded: loaded (/etc/systemd/system/test2.target; static; vendor preset: disabled)
    Active: active since Fri 2017-11-17 13:09:50 CET; 6min ago
 
 nov 17 13:09:50 game systemd[1]: Reached target test 2.

Note: if your unit file has an [Install] section, "enable" the unit before activating it:

 # systemctl enable unit

Try this with the preceding example. Remove the dependency from test2.target an add [Install] section to test1.target containing WnatedBy=test2.target.

Removing Units

To remove a unit, follow these steps:

1. Deactivate the unit if necessary:

 # systemctl stop unit

2. If the unit has an [Install] section, disable the unit to remove any dependent symbolic links:

 # systemctl disable unit

3. Remove the unit file, if you like.

systemd Process Tracking and Synchronization

systemd wants a reasonable amount of information and control over every process that it starts. The main problem that it faces is that a service can start in different ways; it may fork new instances of itself or even daemonize and detach itself from the original process. To minimize the work that a package developper or administrator needs to do in order to create a working unit file, systemd uses control groups (cgroups), an optional Linux kernel feature that allows for finer tracking of a process hierarchy. If you've worked with Upstart before, you know that you have to do a little extra work to figure out what the main process is for a service. In systemd, you don't have to worry about how many times a process forks —— just whether it forks. Use the type option in your service unit file to indicate its startup behaviour. There are two basic startup styles:

 Type=simple  The service process doesn't fork. 
 Type=forking The service forks, and systemd expects the original service process to terminate. Upon termination, systemd assumes that the service is ready. 

The Type=simple option doesn't account for the fact that a service may take some time to set up, and systemd doesn't know when to start any dependencies that absolutely require such a service to be ready. One way to deal with this is to use delayed startup (see Section 6.4.7). However, some Type startup styles can indicate that the service itself will notify systemd when it is ready:

 Type=notify The service sends a notification specific to systemd (with the sd_notify() function call) when it's ready. 
 Type=dbus   The service registers itself on the D-bus (Desktop Bus) when it's ready. 

Another service startup style is specified with Type=oneshot; here the service process actually terminates completely when it's finished. With such a service, you will almost certainly need to add a RemainAfterExit=yes option so that systemd will still regard the service as active even after its processes terminate.

Finally, there's one last style: Type=idle. This simply instructs systemd not to start the service until there are no active jobs. The idea here is just to delay a service start until other services have started to keep the system load down, or to keep services from stepping on one another's output. (Remember, once a service has started, the systemd job that started the service temrinates.)

systemd On-Demand and Resource-Parallelized Startup

One of systemd's most significant features is its ability to delay a unit startup until it is absolutely needed. The setup typically works like this:

 1. You create a systemd unit (call it Unit A) for the system service that you'd like to provide, as normal. 
 2. You identify a system resource such as a network port/socket, file, or device that Unit A uses to offer its services. 
 3. You create another systemd unit, Unit R, to represent that resource. These units have special types such as socket units, path units, and device untis. 

Operationally, it goes like this:

 1. Upon activation of Unit R, systemd monitors the resource. 
 2. When anything tries to access the ressource, systemd blocks the resource, and the input to the resource is buffered. 
 3. systemd activates Unit A. 
 4. When the service from Unit A is ready, it takes control of the resource, reads the buffered input, and runs normally. 

There are few concerns:

 * You must make sure that your resource unit covers every resource that the service provides. This normally isn't a problem, as most services have just one point of access. 
 * You need to make sure your resource unit is tied to the service unit that it represents. This can be implicit or explicit, and in some cases, many options represents different 
   ways for systemd to perform the handoff to the service unit.  
 * Not all servers know how to interface with the units that systemd can provide. 

If you already know what utilities like inetd, xinetd, and automount do, you'll see that there are a lot of similarities. Indeed, the concept is nothing new (and in fact, systemd includes support for automount units). We'll go over an example of a socket unit a bit later. But let's first take a look at how these resource units help you at boot time.

Boot Optimization with Auxiliary Units

A common style of unit activation in systemd attempts to simplify dependency order and speed up boot time. It's similar to on-demand startup in that a service unit and an auxiliary unit represent the service unit's offered resource, except that in this case systemd starts the service unit as soon as it activates the auxiliary unit. The reasoning behind this scheme is that essiantial boot-time service inits such as syslod and dbus take some time to start, and many other units depend on them. However, systemd can offer a unit's essential resource (such as socket unit) very quickly, and then it can immediatly activate not only the essential unit but also any units that depend on the essential resource. Once the essential unit is ready, it takes control of the resource. Figure 6-2 shows how this might work in a traditional system. In this boot timeline, Service E provides an essential Resource R. Service A, B, and C depend on this resource and must wait until Service E has started. When booting, the system takes quite long time to get arround to starting Service C.

Figure 6-2.gif

Figure 6-3 shows an equivalent systemd boot configuration. The service are represented by Units A, B, C and a new Unit R represents the resource that Unit E provides. Because systemd can provide an interface for Unit R while Unit E starts, Unit A, B, C and E can all be started at the same time. Unit E takes over for Unit R when ready. (An intersting point here is that Units A, B and C may not need to explicitly access Unit R before they finish their startup, as Unit B in the figure demonstrates).

Figure 6-3.gif

Note: When parallelizing startup like this, there is a chance that your system may slow down temporarily due to a large number of units at once.

The taskeaway is that, although you're not creating an on-demand unit startup in this case, you're using the same features that make on-demand startup possible. For common real-life example, see the syslog and D-bus configuration units on a machine running systemd; they're very likely to be parallelized in this way.

An example Socket Unit and Service

We'll know look at an example, a simple network echo service that uses a socket unit. This is somewhat advanced material, and you may not really understand it until you've read the discussion of TCP, ports, and listening in Chapter 9and sockets in Chapter 10, so feel free to skip this and come back later.

The idea behind this service is that when a network client connects to the service, the service repeats anything that the client sends back to the client. The unit will listen on TCP port 22222. We'll call it the echo service and start with a socket unit, represented by the following echo.socket unit file:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [Unit]
 Description=echo socket
 
 [Socket]
 ListenStream=22222
 Accept=yes

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Note that there's no mention of the service unit that this socket supports inside the unit file. So what is the corresponding service unit file? Its name is echo@service. The link is done by naming convention; if a service unit file has the same prefix as a .socket file (in this case, echo), systemd knows to activate that service unit when ther's activity on the socket unit. In this case, systemd creates an instance echo@.service when there's activity on echo.socket.

Here is the echo@.service unit file:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [Unit]
 Description=echo service
 
 [Service]
 ExecStart=-/bin/cat
 StandardInput=socket

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Note: If you don't like the implicit activation of units based on the prefixes, or you need to create an activation mechanism between two units with different prefixes, you can use an explicit option in the unit defining your resource. For ex.: use Socket=bar.socket inside foo.service to have bar.socket hand its socket to foo.service.

To get this example service unit running, you need to start echo.socket unit behind it, like this:

 # systemctl start echo.socket

Now you can test the service by connecting to your local port 22222. When the following telnet command connects, type anything and press ENTER. The service repeats what you typed back to you:

 $ telent localhost 22222
 Trying 127.0.0.1...
 Connected to localhost. 
 Escape character is '^]'.
 HI There. 
 HI There. 

When you're bored with this, press CTRL-] on a line by itself, and then CTRL-D. To stop the service, stop the socket unit:

 # systemctl stop echo.socket

Instance and Handoff

Because the echo@service unit supports multiple simultaneous instances, there's and @ in the name. Why should you want multiple instances? The reason is that you may have more than one network client connecting to the service at the same time, and each connection should have its own instance. In this case, the service unit must support multiple instance because of the Accept option in echo.socket. That option instructs systemd not only to listen on the port, but also to accept incoming connections and pas the incoming connections on to the service unit, with each connection a separate instance. Each instance reads data from the connection as standard input, but it doesn't necesarily need to know that the data is coming from a network connection.

Note: Most network connections require more flexibility than just a simple gateway to standard input and output, so don't expect to be able to create network services with a service unit file like the echo@.service unit file shown here.

Although the service init could do all of the work of accepting the connection, it wouldn,t have the @ in it name if it did. In this case, it would take complete control of the socket, and systemd wouldn't attempt to listen on the network port again until the service unit has finished. The many different resources and options for handoff to service units make it difficult to provide a categorical summary. Also, the documentation for the options is spread out over several manual pages. The ones to check for the resource-oriented units are systemd.socket(5), systemd.path(5), and systemd.device(5). One document that's often overlooked for service units is systemd.exec(5), which contains information about how the service unit can expect to receive a resource upon activation.

systemd System V Compatibility

One feature that sets systemd apart from other newer-generation init systems is that it tries to do a more complete job of tracking services started by System V-compatible init scripts. It works like this:

 1. First, systemd activates runlevel<N>.target, where N is the runlevel. 
 2. For each symbolic link in /etc/rc<N>.d, systemd identifies the script in /etc/init.d.
 3. systemd associates the script name with a service unit (for ex.: /etc/init.d/foo would be foo.service). 
 4. systemd activates the service unit and runs the script with either a start or stop argument, based on its name in rc<N>.d.
 5. systemd attemtps to associate any processes fromthe script with the service init. 

Because systemd makes the association with a service unit name, you can use systemctl to restart the service or view its status. But don't expect any miracle from System V compatibility mode; it still run the init scripts serially, for example.

systemd Auxiliary Programs

When starting out with systemd, you may notice the exceptionally large number of programs in /lib/systemd. These are primarily support programs for units. For ex.: udevd is part of systemd, and you'll find it there as system-udevd. Another, the system-fsck program, works as a middleman between systemd and fsck. Many of these programs exist because htey contain notification mechanisme that the standard system utilities lack. Often, they simply run the standard system utilities and notify systemd of the results. (After all, it would be silly to try to reimpliment all of fsck inside systemd.)

Note: One other interesting aspect of these programs is that they are written in C, because one goal of systemd is to reduce the number of shell scripts on a system. There is some debate as to whether it's a good idea to do so (after all, many of these programs could probably be written as shell scripts), but as long as everything works and does so reliably, securly, and reasonably quickly, ther's little reason to bother taking sides.

When you see a program in /lib/systemd that you can't identify, see the manual page. There's a good chance that the manual page will not only describe the utility but also describes the type of unit that it's mean to augment. If you're not unning (or Interested in) Upstart, skip ahead to Section 6.6 for an overview of System V init process.

Upstart (skipping)

Upstart Initialization Procedure (skipping)

Upstart Jobs (skipping)

Upstart Configuration (skipping)

Upstart Operation (skipping)

Upstart Runlevels and System V Compatibility (skipping)

System V init

The System V init implementation on Linux dates to the early days of Linux; its core idea is to support an orderly bootup to different runlevels with a carefully sequenced process startup. Though System V is now uncommon on most desktop installations, you may encounter System V init in Red Hat Enterprise Linux, as well as in embedded Linux environments such as routers and phones.

There are two major components to s typical System V init installation: a central configuration file and a large set of boot scripts augmented by a symbolic link farm. The configuration file /etc/inittab is where it all starts. If you have System V init, look for a line like the following in your inittab file:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 id:5:initdefault:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

This indicates that the default runlevel is 5.

All lines in inittab take the following form, with four fields separated by colons in this order:

 * A unique identifier (a short string, such as id in the previous example). 
 * The applicable runlevel number(s). 
 * The action that init should take (default runlevel 5 in the previous example). 
 * A command to execute (optional). 

To see how commands work in an inittab file, consider this line:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 l5:5:wait:/etc/rc.d/rc 5

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

This particular line is important because it trigger most of system conmfiguration and services. Here, the wait action determines when and how System V init runs the command: Run /etc/rc.d/rc 5 once when entering runlevel 5; then wait for this command to finish before executes anything in /etc/rc5.d that starts with a number (in the order of the numbers). The followingf are some of the most common inittab actions in addition to initdefault and wait.

respawn

The respawn action tells init to run the command that follows and, if the command finishes executing, to run it again. You're likely to see something like this in an inittab file:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 1:2345:respawn:/sbin/mingetty tty1 

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The getty programs provide login prompts. The line above is used for the first virtual console (/dev/tty1), which is the one you see when you press ALT-F1 or CTRL-F1. The respawn action brings the login prompt back after you log out.

ctrlaltdel

The ctrlaltdel action controls what the system does when you press CTRL-ALT-DEL on a virtual console. On most systems, this is some sort of reboot command, using the shutdown command.

sysinit

The sysinit action is the first thing that init should run when starting, before entering any runlevels.

Note: For more available actions, see the inittab(5) manual page.

System V init: Startup Command Sequence

You are now ready to learn how System V init starts system services, just before it lets you log in. Recall this inittab line from earlier:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 l5:5:wait;/etc/rc.d/rc 5

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

This small line triggers many other programs. In fact, rc stand for run commands, which many people refer to as scripts, programs, or services. But where are these commands?

The 5 in this line tell us that we're talking about runlevel 5. The commands are probalby either in /etc/rc.d/rc5.d or /etc/rc5.d. (Runlevel 1 uses rc1.d, runlevel 2 uses /etc/rc2.d, and so on.) For ex.: you might find the following items in the rc5.d directory:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 S10sysklogd
 S12kerneld
 S15netstd_init
 S18netbase
 S20acct  
————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The rc 5 command starts programs in the rc5.d directory by executing the following commands in this sequence:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 S10sysklogd start
 S12kerneld start
 S15netstd_init start
 S18netbase start
 S20acct start

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Notice the start argument in each command. The capital S in a command name means that the command should run in start mode, and the number (00 through 99) determines where in the sequence rc start the command. The rc*.d commands are usually shell scripts that start programs in /sbin or /usr/sbin. Normally, you can figure out what a particular command does by viewing the script with less or another pager program.

Note: Some rc*.d directories contain commands that start with K (for "kill", or stop mode). In this case, rc runs the command with the stop argument instead of start. You will most likely encounters K commands in runlevels that shut down the system.

You can run these commands by hans. However, you normally want to do so through the init.d directory instead of the rc*.d directories, which we'll now describe.

The System V init Link Farm

The contents of the rc*.d directories are actually symbolic links to files in yet another directory, init.d. If your goal is to interact with, add, delete, or modify service in the rc*.d directories, you need to understand these symbolic links. A long listing of a directory such as rc5.d reveals a structure like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 lrwxrwxrwx . . . S10sysklogd -> ../init.d/sysklogd
 lrwxrwxrwx . . . S12kerneld -> ../init.d/kerneld
 lrwxrwxrwx . . . S15netstd_init -> ../init.d/netstd_init
 lrwxrwxrwx . . . S18netbase -> ../init.d/netbase

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

A large number of symbolic links accros serveral subdirectories such as this is called a link farm. Linux ditributions contain these links so that they can use the same startup scripts for all runlevels. This convention is not a requirement, but it simplifies organization.

Starting and Stopping Services To start and stop services by han, use the script in the init.d directory. For ex.: one way to start the httpd webserver program manually is to run init.d/httpd start'. Similarly, to kill a running service, you can use the stop argument.

Modifying the Boot sequence Changing the boot sequence in System V init is normally done by modifying the link farm. The most common changes is to prevent one of the commands in the init.d directory from running un a particular runlevel. You have to be careful about how you do this. For ex.: you might consider removing the symbolic link in the appropriate rc*.d directory. But beware: If you ever need to put the link back, you might have trouble remembering the exact name of the link. Once of the best wyas to do it is to add an udercore (_) at the beginning of the link name, like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 # mv S99httpd _S99httpd

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

This change cause rc to ignore _S99httpd because the filename no loner start with S or K, but the original name is still obvious. To add a service, create a script like those in the init.d directory and then create a symbolic link in the correct rc*.d directory. The easiest way is to copy and modify one of the scripts already in init.d that you understand. When adding a service, choose an appropriate place in the boot sequence to start it. If the service starts too soon, it may not work, due to a dependency on some other service. For nonessential services, most systems admninistrators prefer numbers in the 90s, which puts the service after most of the service that came with the system.

run-parts

The mechanisme that System V init used to run the init.d scripts has found its way inot many Linux systems, regardless of wheter they use System V init. It's utility called run-parts, and the only thing it does is run a bunch of executable programs in a given directory, in some kind of predictable order. You can think of it as almost like a person who runs the ls command in some direcotry and then just runs whatever programs they see in the output. The default behaviour is to run all programs in a directory, but you often have the option to select certain programs and ingore others. In some distributions, you don't need much control over the progams that run. For ex.:Fedora ships with a very simple run-parts utility.

Other distributions, such as Debian and Ubuntu, have a more complicated run-parts program. Their features include the ability to run programs based on a regular expression (for ex.: using the S[0-9]{2} expression for running all "start" scripts in /etc/init.d runlevel directory) and to pass arguments to the programs. These capabilities allow you to start and stop System V runlevels with a single command.

You don't really need to undrstand the detail of how to use run-parts; in fact, most people don't know that run-parts even exist. The main things to remember are that it shows up in scripts from time to time and that it exist solely to run the programs in a given direct

Controlling System V init

Occasionally, you'll need to give init a little kick to tell it to switch runlevels, to reread its configuration, or to shutdown the system. To control System V init, use telinit. For ex.: to switch to runlevel 3, enter:

 # telinit 3

When switching runlevels, init tries to kill off any processes not in the inittab for the new runlevel, so be careful when changing runlevels. When you need to add or remove jobs, or make any other change to the inittab file, you must tell init about the change and cause it to reload the file. The telinit command for this is:

 # telinit q

You can also use telinit s to switch to singe-user mode (see section 6.9).

Shutting Down Your System

init controls how the system shuts down and reboots. The command to shutdown the system are the same regardless of which version of init you run. The proper way to shutdown a Linux machine is to use the shutdown command.

There are two basic ways to shutdown. If you halt the system, it shuts the machine down and keeps it down. To make the machine halt immdediately, run this:

 # shutdown -h now

On most machines and version of Linux, a halt cuts the power to the machine. You can also reboot the machine. For a reboot, use -r instead of -h. The shutdown process takes several seconds. You should never reset or power off a machine during this stage.

In the preceding example, now is the time to shutdown. This argument is mandatory, but there are many ways to specify the time. For ex.: if you want the machine to shutdown sometine in the future, you can use +n, where n is the number of minutes shutdown should wait before foind its work. To make the system reboot in 10 minutes, enter:

 # shutdown -r +10

On Linux, shutdown notifies anyone logged on that the machine is going down, but it does little real work. If you specifiy a time other than now, the shutdown command creates a file called /etc/nologin. When this file is present, the system prohibits logins by anyone except the superuser. When system shutdown time finally arrives, shutdown tells init to begin the shutdown process. On systemd, it means activating the shutdown units: on Upstart, it means emitting the shutdown events; and on System V init, it means changing the runlevel to 0 or 6. Regardless of the init implementation or configurationm the procedure generally goes like this:

 1. init ask every process to shut down cleanly. 
 2. If a process doesn't respond after a while, init kills it, first trying a TERM signal. 
 3. If the TERM signal doesn't work, init uses the KILL signal on any stragglers. 
 4. The system locks system files into place and makes other preparations for shutdown. 
 5. The system unmounts all filesystems other than the root. 
 6. The system remounts the root filesystem read-only. 
 7. The system writes all buffered data out to the filesystem with the sync program. 
 8. The final step is to tell the kernel to reboot or stop with the reboot(2) system call. This can be done by init or an auxiliary program such as reboot, halt or poweroff. 

The reboot and halt programs behave differently depending in how they're called, which may cause confusion. By default, these programs call shutdown with the -r or -h options. However, if the system is already at a halt or reboot runlevel, the programs tell the kernel to shut itself off immediately. If you really want to shut your machine down in a hurry, regardless of any potential damage from disorderly shutdown, use the -f (force) option.

The Initian RAM Filesystem

The Linux boot process is, for the most part, fairly straightforward. However, one component has always been somewhat confounding: initramfs, or the initial RAM filesystem. Think of this as a little user-space wedge that goes in front of normal user mode start. But first, let's talk about why it exists. The problem stems from the avaibility of many different kinds if storage hardware. Remember, the Linux kernel does not talk to the PC BIOS or EFI interfaces to get data from disks, so in order to mount its root filesystem , it needs dirver support for the underlying storage mechanisme. For ex.: if the root is on a RAID array connected to a third-party controller, the kernel needs the driver for that controller first. Unfortunately, there are so many storage controller drivers that distributions can't include all of them in their kernels, so many drivers are shipped as loadable modules. But loadable modules are files, and if your kernel doesn't have the filesystem mounted in the first place, it can't load the driver modules that it needs.

The workaround is to gather a small collection of kernel driver modules along with a few other utilities into an archive. The boot loader loads this archive into memory before running the kernel. Upon start, the kernel reads the contents of the archive into a temporary RAM filesystem (the initramfs), mount it at /, and performs the user-mode handoff to the init on the initramfs. Then, the utilities included in the initramfs allow the kernel to load the necessary driver modules for the real root filesystem. Finally, the utilities mount the real root filesystem and start true init.

Implementations vary and are ever evolving. On some distributions, the init on the initramfs is a fairly simple shell script that starts a udevd to load drivers, then mounts the real root and executes the init here. On distributions that use systemd, you'll typically see an entire systemd installation there with no unit configuration files and just a few udevd configuration files.

One basic characteristic of the initial RAM filesystem that has (so far) remained unchanged since its inception is the ability to bypass it if you don't need ot. That is, if your kernel has all the drivers it needs to mount your root filesystem, you can omit the initial RAM filesysyem in your boot loader configuration. When succesful, eliminating the initial RAM filesystem shortens boot time, usually by a couple of seconds. Try it yourself at boot time by using the GRUB menu editor to remove the initrd line. (It's best not to experiment by changing the GRUB configuration file, as you can make a mistake that will be difficult to repair.) Recently, it has been a little more difficult to bypass the initial RAM filesystem because features such as mount-by-UUID may not be available with generic distributrion kernels. It's easy to see the contents of your initial RAM filesystem because, on most modern systems, they are simple gzip-compressed cpio archives (see the cpio(1) manual page). First, find the archive file by looking at your boot loader configuration (for ex.: grep for initrd lines in your grub.cfg configuration file). Then use cpio to dump the contents of the archive into a temporary direcotry somewhere and peruse the results. For ex.:

 $ mkdir /tmp/myinitrd
 $ cd /tmp/myinitrd
 $ zcat /boot/initramfs-linux-lts.img | cpio -i --no-absolute-filenames

One particular piece of interest is the "pivot" near the very end of the init process of the initial RAM filesystem. This part is responsible for removing the contents of the temporary filesystem (to save memory) and permanently switch to the real root.

You won't typically create your own initial RAM filesystem, as this is a painstaking process. There are a number of utilities for creating initial RAM filesystem images, and your ditribution likely comes with one. Two of the most common are dracut and mkinitramfs.

Note: The term initial RAM filesystem (initramfs) refers to the implementation that uses the cpio archive as the source of the temporary filesystem. There is an older version called the initial RAM disk, or initrd, that use a disk image as the basis of the temoirary filesystem. This has fallen into disuse because it's much easier to maintain a cpio archive. However, you'll often see the term initrd used to refer to a cpio-based initial RAM filesystem. Often, as in the preceding example, the filenames and configuration files will still contain initrd.

Emergency Booting and Single-User Mode

When something goes wrong with the system, the first recourse it usuallu to boot the system with a distribution's live image (most ditributions installation images double as live images) or with a dedicated rescue image such a SystemRescuCD that you can put on removable media. Common tasks for fixing a system include the following:

 * Checking filesystem after a system crash. 
 * Resetting a forgotten root password. 
 * Fixing problems in critical files, such as /etc/fstab and /etc/passwd/ 
 * Restoring from backups after a system crash. 

Another option for booting quickly to usable stage is single-user mode. The idea is that the system quickly boots to a root shell instead of going through the wole mess of services. in the System V init, single-user mode is usually runlevel 1, and you can also enter the mode with an -s parameter to the boot loader. You may need to type the root password to enter single-user mode.

The biggest problem with single-user mode is that is doen't offer many amenities. The network almost certainly won't be available (and if it is, it will be hard to use), you won't have a GUI, and your terminal may not even work correclty. For this reason, live images are nearly always considered preferable.

System configuration: Logging, System Time, Batch Jobs, and Users

In this chapter we're going to look at the following:

 * Configuration files that the system librairies access to get server and user information. 
 * Server programs (sometimes called daemons) that run when the system boots. 
 * Configuration utilities that can be used to tweak the server programs and configuration files. 
 * Administration utilities. 

The structure of /etc

Most system configuration files on a Linux system are found in /etc. Historically, each program had one or more configuration files there, and because there are so many packages on a Unix system, /etc would accumulate files quickly. There were two problems with this approach: It was hard to find particular configuration files on a running system, and it was difficult to maintain a system configured this way. For ex.: if you wanted to change the system logger configuration, you'd have to edit /etc/syslog.conf. But after your change, an upgrade to your ditribution could wipe out your customizations.

The trend for many years now has been to place system configuration files into subdirectories under /etc, as you've already seen for the boot directories (/etc/init for Upstart and /etc/systemd for systemd). There are still a few individual configuration files in /etc, but for the most part, if you run ls -F, you'll see that most of the items there are now subdirectories. To solve the problem of overwriting configuration files, you can now place customizations in separate files in the configuration subdirectories, such as the ones in /etc/grub.d.

What kinf of configuration files are found in /etc? The basic guideline is that customizable configurations for a single machine, such as user information (/etc/passwd) and network details (/etc/network), go into /etc. However, general application details, such as a distribution's defaults for a user interface, don't belong in /etc. And you'll often find that noncustomizable system configuration files may be found elsewhere, as with the prepackaged systemd unit files in /usr/lib/systemd.

You've already seen some of the configuration files that pertain to booting. Now we'll look at a typical system service and how to view and specify its configuration.

System Logging

Most system programs write their diagnostic output to the syslog service. The traditional syslogd daemon waits for messages and, depending on the type of message received, funnels the output to a file, the screen, users, or some combination of these, or just ignores it.

The System Logger

The system logger is one of the most important parts of the system. When something goes wrong and you don't know where to start, check the system log files first. Here is a sample log file message:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Aug 23 12:12:23 duplex sshd[484]: Server listening on 0.0.0.0 port 22

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Most Linux distributions run a new version of syslogd called rsyslogd that does much more than simply write log messages to files. For ex.: you can use it to load a module to send log messages to a database. But when starting out with system logs, it's easiest to start with the log files normally stored in /var/log. Check out some log files —— Once you know what they look like, you'll be ready to find out how they got there. Many of the files in /var/log aren't maintained by the system logger. The only way to know for sure which ones belong to rsyslogd is to look at its configuration file.

Configuration Files

The base rsyslogd configuration file is /etc/rsyslog.conf, but you'll find certain configurations in other directories, such as /etc/rsyslog.d. The configuration format is blend of traditional rules and rsyslog-specific extensions. One rule of thumb is that anything beginning with a dolar sing ($) is an extension. A traditional rule has a selector and an action to show how to catch logs and where to send them, respectively. For ex.:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 kern.*                       /dev/console
 *.info;authpriv.none         /var/log/messages
 authpriv.*                   /var/log/secure,root
 mail.*                       /var/log/maillog
 cron.*                       /var/log/cron
 *.emerg                      *
 local7.*                     /var/log/boot.log

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The selector is on the left. It's the type of information to be logged. The list on the right is the action: where to send the log. Most actions are normal files, with some exceptions. For ex.: /dev/console refers to a special device for the system console, root means send a message to the superuser if that user is logged in, and * means message all users currently on the system. You can also send messages to another network host with @host.

Facility and Priority The selector is a pattern that matches the facility and priority of log messages. The facility is a general category of message (see rsyslog,conf(5) for a list of all facilities). The function of most facilities will be fairly obvious from their name. For ex.: the configuration file catches messages carrying the kern, authpriv, mail, cron, and local7 facilities. In this same listing, the asterisk (*) is a wildcard that catches output related to all facilities. The priority follows the dot (.) after the facility. The order of priorities from lowedt to highest is debug, info, notice, warning, err, crit, alert, or emerg.

Note: To exclude log message from a facility in rsyslog.conf, specify a priority of none, as shown in the above example.

When you put a specific priority in a selector, rsyslogd sends messages with that priority and all higher priorities to the destination on that line. Therefore, the *.info actually catches most log messages and puts them into /var/log/messages because info is a relatively low priority.

Extended Syntax

As previously mentioned, the syntax of rsyslogd extends the traditional syslogd syntax. The configuration extensions are called directives and usually begin with a $. One of the most common extensions allows you to load additional configuration files. Check your rsyslog.conf file for a directive like this, which causes rsyslogd to load all .conf.. files in /etc/rsyslog.d into the configuration:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 $IncludeConfig /etc/rsyslog.d/*.conf

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Most of the other extended directives are fairly self-explanatory. For ex.: these directives deal with users and permissions:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 $FileOwner syslog
 $FileGroup adm
 $FileCreateMod 0640
 $DirCreateMode 0755
 $Umaks 0022

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Note: Additional rsyslogd configuration file extensions define output templates and channels. If you need to use them, the rsyslod(5) manual pages is fairly comprehensive, but the web-based documentation is more complete.

Troubleshooting

One of the easiest ways to test the system logger is to send a log message manually with the logger command, as shown here:

 $ logger -p daemon.info something bad just happened

Very little can go wrong with rsyslogd. The most common problems occur when a configuration doesn't catch a certain facility or priority or when a log files fill their disk partitions. Most ditributions automatically trim the files in /var/log with automatic invocations of logrotate or a similar utility, but if too many messages arrive in a brief period, you can still fill the disk or end up with a high system load.

Note: The logs caught by rsyslogd are not the only ones recorded by various pieces of the system. We discussed the startup log messages captured by systemd and Upstart in Chapter 6, but you'll find many other sources, such as the Apache Web server, which normally records its own access and errors logs. To find those logs, see the server configuration.

Logging: Past and Future

The syslog service has evolved over time. For ex.: there was once a daemon called klogd that trapped kernel diagnostic messages for syslogd (these messages are the ones you see with the dmesg command). This cpability has been folded into rsyslogd.

It's a near certainly that Linux system logging will change in the future. Unix system logging has never had a true standard, but efforts are underway to change that.

User Management Files

Unix system allows for multiple independent users. At the kernel level, users are simply numbers (user IDs), but because it's much easier to remember a name than a number, you'll normally work with usernames (or login names) instead of user IDs when managing Linux. Usernames exist only in user space, so any program that works with a username generally needs to be able to map the username to a user ID if it wants to refer to a user when talking to the kernel.

The /etc/passwd file

The plaintext file /etc/passwd maps usernames to user IDs. It looks something like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 root:x:0:0:root:/root:/usr/bin/zsh
 bin:x:1:1:bin:/bin:/usr/bin/nologin
 daemon:x:2:2:daemon:/:/usr/bin/nologin
 alice:x:1000:1000:alice:/home/alice:/bin/bash

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Each line represents one user and has seven fields separated by colons. The fields are as follows:

 * The username. 
 * The user's encrypted password. On most Linux systems, the password is not actually stored in the passwd file, but rather , in the shadow file. An x in the second passwd 
   file field indicates that the encrypted password is stored in the shadow file. A star (*) indicates that the user cannot log in, and if the field is blank (that is, you see two 
   colons in a row, like ::), no password is required to log in. 
 * The user ID (UID), which is the user's representation in the kernel. 
 * The group ID (GID). This should be one of the numbered entries in the /etc/group  file. Groups determine file permissions and little else. This group is also called the user's primary group. 
 * The user's real name (often called the GECOS fiels). 
 * The user's home directory. 
 * The user's shell (the program that runs when the user runs a terminal session). 

The /etc/passwd file syntax is fairly strict, allowing for no comments or blank lines.

Note: A user in /etc/passwd and a corresponding home directory are collectivly known as an account,

Special Users

You will find a few special users in /etc/passwd. The superuser (root) always has UID 0 and GID 0. Some users, such as daemon, have no login privileges. The nobody user is an unprivileged user. Some processes run as nobody because the nobody user cannot write to anything on the system. The users that cannot log in are called pseudo-users. Although they can't log in, the system can start processes with their user IDs. Pseudo-users such as nobody are usually created for security reasons.

The /etc/shadow File

The shadow password file (/etc/shadow) on a Linux system normally contains user authentication information, including the encrypted passwords and password expiration information that correspond to the users in /etc/passwd. The shadow file was introduced to provide a more flexible (and more secure) way storing passwords. It included a suit of librairies and utilities, many of which were soon replaced by pieces of PAM (see section 7.10). Rather than introduce en entirely new set of files for Linux, PAM uses /etc/shadow, but not certain corresponding configuration files such as /etc/login.defs.

Manipulating Users and Passwords

Regular users interact with /etc/passwd using the passwd command. By default, passwd changes the user's password, but you can also use -f option to change the user's real name or -s to change the user's shell to one listed in /etc/shells. (You can also use the commands chfn and chsh to change the real name and shell.) The passwd command is an suid-root program, because only the superuser can change the /etc/passwd file.

Changing /etc/passwd as the Superuser

Because /etc/passwd is plaintext, the superuser may use any text editor to make changes. To add a user, simply add an appropriate line and create a home directory for the user; to delete, do the opposit. However, to edit the file, you'll most likely want to use the vipw program, which backs up and locks /etc/passwd while you're editing it as an added precaution. To edit /etc/shadow instead of /etc/passwd, use vipw -s. (You'll likely never need to do this, though.)

Most organization frown on editing passwd directly because it's too easy to make a mistake. It's much easier (and safer) to make changes to users using spearate commands available from the terminal or through the GUI. For ex.: to set a user's password, run passwd user as the superuser. Use adduser and userdel to add and removes users.

Working with Groups

Groups in Unix offer a way to share files with certain users but deny access to all others. The idea is that you can set read and write permission bits for a particular group, excluding everyone else. This features was once important because many users shared one machine, but it becomes less significant in recent years as workstations are shared less often. The /etc/group file defines the group IDs (such as the ones found in the /etc/passwd file). Here is an example:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 root:x:0:root
 bin:x:1:root,bin,daemon
 daemon:x:2:root,bin,daemon
 sys:x:3:root,bin
 adm:x:4:root,daemon
 tty:x:5:
 disk:x:6:root
 lp:x:7:daemon,alice
 mem:x:8:
 kmem:x:9:
 wheel:x:10:root,alice

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Like the /etc/passwd file, each line in /etc/group is a set of fields separated by colons. The fileds in each entry are as follows, from left to right:

 * The group name: This appears when you run a command like ls -l. 
 * The group password: This is hardly ever used, nor should you use it (use sudo instead). Use * or any other default value. 
 * The group ID (a number): The GID must be unique within the group file. This number goes into a user's group field in that user's /etc/passwd entry. 
 * An optional list of users that belong to the group: In addition to the user listed here, users with the corresponding group ID in their passwd file entries also belong to the group. 

To see the groups you belong to, run groups.

Note: Linux distributions often create a new group for each new user added, with the same name as the user.

getty and login

getty is a program that attaches to terminals and displays a login prompt. On most Linux systems, getty is uncomplicated because the system only use it for logins on virtual terminals. In a process listing, it usually looks something like this (for ex.: when running on /dev/tty1):

 $ ps ao args | grep getty
 /sbin/getty  38400 tty1

In this example, 38400 is the baud rate. Some getty programs don't need the baud rate settings. (Virtual terminals ignore the baud rate; it's only there for backward compatibility with software that connects to real serial lines.) After you enter your login name, getty replaces itself with the login program, which asks for your password. If you enter the correct password, login replaces itsel (using exec()) with your shell. Otherwise, you get a "login incorrect" message.

You now know what getty and login do, but you'll probabily never need to configure or change them. In fact, you'll rarly even use them, because most users now log in either through a GUI such as gdm or remotely with SSH, neither of which uses getty or login. Much of the login program's real authentification work is handled by PAM (see Section 7.10).

Setting the Time

Unix machines depend on accurate timekeeping. The kernel maintains the system clock, which is the clock that is consulted when you run comands like date. You can also set the system clock using the date command, but it's usually a bad idea to do so, because you'll never get the time exactly right. Your system clock should be as close to the correct time as possible. PC hardware has a battery-backed real-time clock (RTC). The RTC isn't the best clock in the world, but it's better than nothing. The kernel usually set its time based on the RTC at boot time, and you can reset the system clock to the current hardware time with hwclock. Keep your hardware clock in Universal Coordinate Time (UTC) in order to avoid any trouble with time zone or daylight savings time corrections. You can set the RTC to your kernel's UTC clock using this command:

 # hwclock --systohc --utcv

Unfortunately, the kernel is even worse at keeping time than the RTC, and because Unix machines often stay up for months or years on a single boot, they trend to develop time drift. Time drift is the current difference between the kernel time and the true time (as defined by an atomic clock or another very accurate clock).

You should not try to fix the drift with hwclock because time-based system events can get lost or mangled. You could run a utility like adjtimex to smoothly update the clock, but usually it's best to keep your system time correct with a network time daemon (see Section 7.5.2).

Kernel Time Representation and Time Zones

The kernel's system clock represents the current time as the number of seconds since 12:00 midnight on January 1, 1970, UTC. To see this number at the moment run:

 $ date +%s

To convert this number into something that humans can read, user-space programs change it to local time and compensate for daylight savings time and any other strange circumstances (such as living in Indiana). The local time zone is controlled by the file /etc/localtime. (Don't bother trying to look at it, it's a binary file.)

The time zone files on your system are in /usr/share/zoneinfo. You'll find that this directory contains a lot of time zones and a lot of aliases for time zones. To set your system's time zone manually, either copy one of the files in /usr/share/zoneinfo to /etc/localtime (or make a symbolic link) or change it with your ditribution's time zone tool. (The command-line program tzselect may help you identify a time zone file.) To use a time zone other than the system default for just one shell session, set the TZ environment variable to the name of a file in /usr/share/zoneinfo and test the change, like this:

 $ export TZ=US/Central
 $ date  

As with other environment variables, you can also set the time zone for the duration of a single command like this:

 $ TZ=US/Central date

Network Time

If your machine is permanently connected to the Internet, you can run a Network Time Protocol (NTP) daemon to maintain the time using a remote server. Many distributions have built-in support for an NTP daemon, but it may not be enabled by default. You might need to install an ntpd package to get it work.

If you need to do the configuration by hand, you'll find help on the main NTP web page, but if you'd rather not read through the mounds of documentation there, do this:

 1. Find the closest NTP time server from your ISP or from the ntp.org web page. 
 2. Put that time server in /etc/ntpd.conf 
 3. Run ntpdate server at boot time. 
 4. Run ntpd at boot time, after the ntpdate command. 

If your machine doesn't have a permanent Internet connection, you can use a deamon like chronyd to maintain the time during disconnections. You can also set your hardware clock based on the network time in order to help your system maintain time coherency when it's reboots. (Many distributions do this automatically.) To do so, set your system time from network with ntpdate (or ntpd), then run the command:

 $ hwclock --systohc --utc

Scheduling Recurring Tasks with cron

The Unix cron service runs programs repeatedly on a fixed schedule. Most experienced administrators consider cron to be vital to the system because it can perform automatic system maintenance. For ex.: cron runs log file rotation utilities to ensure that your hard drive doesn't fill ip with old log files. You should know how to use cron because it's just plain useful. You can run any program with cron at whatever times suit you. The program running through cron is called a cron.job. To install a cron job, you'll create an entry line in your crontab file, usually by running the crontab command. For ex.: the crontab entry schedules the /home/juser/bin/spmake command daily at 9:15 AM:

  —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
 15 09 * * * /home/juser/bin/spmake

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The five fields at the beginning of this line, delimited by whitespace, specify the scheduled time. The field are as follows in order:

 * Minute (0 through 59). The cron job above is set for minute 15. 
 * Hour (0 through 23). The job above is set for ninth hour. 
 * Day of mounth (1 through 31). 
 * Month (1 through 12). 
 * Day of week (0 through 7). The numbers 0 and 7 are Sunday. 

A star (*) in any field means to match every value. The preceding example runs spmake daily because the day of the month, month, and day of week fields are all filled with starts, which cron reads as "run this job every day, every month, or every week".

To run spmake only on the 14th day of each month, you would use this crontab line:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 15 09 14 * * /home/juser/bin/spmake

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

You can select more than one time for each field. For ex.: to run the program on the 5th and the 14th day of each month, you could enter 5,14 in the third field:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 15 09 5,14 * * /home/juser/bin/spmake

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Note: If the cron job generates standard output or an error or exits abnormally, cron should mail this information to you. Redirect the output to /dev/null or some other log file if you find the email annoying.

The crontab(5) manual page provides complete information on the crontab format.

Installing Crontab Files

Each user can have his own crontab file, which means that every system may have multiple crontabs, usually found in /var/spool/cron/crontabs. Normal users can't write to this directory; the crontab command installs, list, edits, and removes a user's crontab.

The easiest way to install a crontab is to put your crontab entries into a file and then use crontab file to install file as your current crontab. The crontab command checks the file format to make sure that you haven't made any mistakes. To list your cron jobs, run crontab -l. To remove the crontab, use crontab -r.

However, after you've created your initial crontab, it can be a bit messy to use temporary files to make further edit's. Instead, you can edit and install your crontab in one step with the crontab -e command. If you make a mistake, crontab should tell you where the mistake is and ask if you want to try editing again.

System Crontab Files

Rather than use the superuser's crontab to schedule recurring system tasks, Linux distributions normally have an /etc/crontab file. Don't use crontab to edit this file, because this version has an additional field inserted before the command to run —— the user that should run the job. For ex.: this cron job defined in /etc/crontab runs at 6:42 AM as the superuser (root):

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 42 6 * * * root /user/local/bin/cleansystem > /dev/null 2>&1

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Note: some distributions store system crontab files in the /etc/cron.d/ direcotry. These files may have any name, but they have the same format as /etc/crontab.

The futur of cron

The cron utility is one of the oldest components of a Linux system; it's been around for decades (predating Linux itself), and its configuration format hasn't changed much for many years. When something gets to be old, it becomes fodder for replacement, and there are efforts underway to do exactly that.

The proposed replacements are actually just parts of the newer versions of init: For systemd, there are timer units, and for Upstart, the idea is to be able to create recurring events to trigger jobs. After all, both versions of init can run tasks as any user, and they offer certain advantages, such as custom logging.

However, the reality is that neither systemd nor Upstart currently has all of the capabilities of cron. Furthermore, when they do become capable, backward compatibility will be necesary to support everythong that relies on cron. For these reasons, it's unlikely that the cron format will go away anytime soon.

Scheduling One-time tasks with at

To run a jobe once in the future without using cron, use the at service. For ex.: to run myjob at 10:30 PM, enter this command:

 $ at 22:30
 at> myjob

End the input with ctrl-D (the at utility reads the command from the sdin). To check that the job has been scheduled, use atq. To remove it, use atrm. You can also schedule jobs days into the future by adding the date in DD.MM.YY format. For ex.: at 22:30 30.09.2015. There isn't much else to the at command. Though at isn't used that often, it can be hanfy for that odd time when you need to tell the system to shutdown in the future.

Understanding User IDs and User Switching

We've discussed how setuid programs such as sudo and su allow you to change users, and we've mentioned system components like login that control user access. Perhaps you're wondering how these peices work and what role the kernel plays in user switching.

There are two ways to change a user ID, and the kernel handles both. The first is with a setuid executable. The second is through the setuid() family of system calls. There are a few different versions of this system call to accommodate the various user IDs associated with a process, as you'll learn in Section 7.8.1.

The kernel has basic rules about what a process can or can't do, but here are the three basics:

 * A process running as root (userid 0) can use setuid() to become any other user. 
 * A process not running as root has severe restrictions on how it may use setuid(); in most cases, it cannot. 
 * Any process can execute a setuid program as long as it has adequate file premissions. 

Note: User switching has nothing to do with passwords or usernames. Those are striclty user-space concepts, as you first saw in the /etc/passwd file.

Process Ownership, effective UID, real UID, and saved UID

Our discussion of user IDs so far has been simplified. In reality, every process has more than one user ID. We've described the effective user ID (euid0, which defines access rights for a process. A second user ID, the real user ID (ruid), indicates who initiated a process. When you run a setuid program, Linux sets the effective user ID to the program's owner during execution, but it keeps your original user ID in the real user ID.

On modern systems, the difference between effective and real user IDs is confusing, so much so that a lot of documentation regarding process ownership is incorrect.

Think of the effective user ID as the actor and the real user ID as the owner. The real user ID defines the user that can interact with the running process —— most significantly, which user can kill and send signals to a process. For ex.: if user A starts a new process that runs as user B (based on setuid permission), user A still owns the process and can kill it.

On normal Linux systems, most processes have the same effective user ID and real user ID. By default, ps and other system diagnostic programs show the effective user ID. To view both the effective and real user IDs on your system, try this, but don't be surprised if you find that the two user ID columns are identical for all processes on your system:

 $ ps -eo pid,euser,ruser,comm

To add to the confusion, in addition to the real and effective user IDs, there is also a saved user ID (which is usually not abbreviated). A process can switch its effective user ID to the real or saved ID during execution. (to make things even more complicated, Linux has yet another user ID: the file system user ID (fsuid), which degines the user accessing the filesystem but its rarely used.)

Typical setuid program behavior

The idea of the real user ID might contradict your previous experience. Why don't you have to deal with the other user IDs very frequently? For ex.: after starting a process with sudo, if you want to kill it, you still use sudo; you can't kill it as your own regular user. Shouldn't your regular user be the real user ID in this case, giving you the correct permissions? The cause of this behavior is that sudo and many other setuid programs explicitly change the effective and real user IDs with one of the setuid() system calls. These programs do so because there are often unintended side effects and access problems when all of the user IDs do not match.

Note: If you're interested in the details and rules regarding user ID switching, read the setuid(2) manual page and check the other manual pages listed in the SEE ALSO section. There are many different system calls for diverse situations.

Some programs don't like to have a real user ID of root. To prevent sudo from changing the real user ID, add this line to your /etc/sudoers file (and beware of side effects on other programs you want to run as root!):

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Defaults stay_setuid

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Security Implications

Because the Linux kernel handles all user switches (and as a result, file access permissions) through setuid programs and subsequent system calls, system developers and administrators must be extremely careful with two things:

 * The programs that have setuid permissions
 * What those programs do

If you make a copy of the base shell that is setuid root, any local user can execute it and have complete run of the system. It's really that simple. Furthermore, even a special-purpose program that is setuid root can pose a danger if it has bugs. Exploiting weakness in programs running as root is a primary method of systems intrusion, and there are too many such exploits to count. Because there are so many to break into a system, preventing intrusion is a multifaceted affair. One of the most essential ways to keep unwanted activity off your system is to enforce user authentification with usernames and paswords.

User Identification and Authentication

A multiuser system must provide basic support for user security in terms of identification and authentification. The identification portion of security answers the question of who users are. The authentification pices ask users to prove that they are who they say they are. Finally, authorization is used to define and limit what users are allowed to do.

When it comes to user identification, the Linux kernel knows only the numeric user IDs for process and file ownership. The kernel jnows authorization rules for how to run setuid executables and how user IDs may run the setuid() family system calls to change from one user to another. However, the kernel does not know anything about authentification: user-names, passwords, and so on. Pratically everythong related to authentification happens in user space.

We've discussed the mapping between user IDs and passwords in Section 7.3.1; now we'll explain how user processes access this mapping. We'll begin with an oversimplified case, in which a suer process wants to know its username (the name corresponding to the effective user ID). On traditional Unix system, a process could do something like this to get its username:

 1. The process asks the kernel for its effective user ID with the getuid() system call. 
 2. The process opens the /etc/passwd file and starts reading at the beginning. 
 3. The process reads a line of the /etc/passwd file. If there's nothing left to read, the process has failed to find the username. 
 4. The process parses the line into fields (breaking out everything between the colons). The third field is the user ID for the current line. 
 5. The process compares the ID from Step 4 to the ID from Step 1. If they're identical, the first field in Step 4 is the desired username, and the process can stop searching and use this name. 
 6. The process moves on to the next line in /etc/passwd and goes back to Step 3. 

This is a long procedure that's usually much more complicated in reality.

Using Librairies for User information

If every developer who needed to know the current username had to write all of the code you've just seen, the system would be a horrifyingly disjointed, buggy, bloated, and unmaintainable mess. Furtonately, we can use standard librairies to perform reptitive tasks, so all you'd normally need to do to get a username is call a function like getpwuid() in the standard library after you have the answer from getuid(). (See the manual pages for these calls for more on how they work.)

When the standard library is shared, you can make significant changes to the implementation without changing any other programs. For ex.: you can move away from using /etc/passwd for your users and use a network service such as LDAP instead.

This approach has worked well for identifying usernames associated with user IDs, but passwords have proven more troublesome. Section 7.3.1 describes how, traditionally, the encrypted password was part of the /etc/passwd, so if you wanted to verify a password that a user entered, you'd encrypt whatever the user typed and compare it to the contents of the /etc/passwd file. This traditional implementation has the following limitations:

 * It doesn't set a system-wide standard for the encryption protocol. 
 * It assumes that you have access to the encrypted password. 
 * It assumes that you want to prompt the user for a password every time the user wants to access something that requires authentification (which gets annoying). 
 * Ut assumes that you want to use passwords. If you want to use one-time tokens, smart cards, biometrics, or some other form of user authenticationm, you have to add that support yourself. 

Some of these limitations contributed to the development of the shadow password package discussed in Section 7.3.3, which took the first step in allowing system-wide password configuration. But the solution to the bulk of the problems came with the design and implementation of PAM.

PAM

To accommodate flexibility in user authentification, in 1995 Sun Microsystems proposed a new standard called Puggable Authentification Modules (PAM), a system of shared libraries for authentification (Open Source Software Foundations RFC 86.0, October 1995). To authenticate a user, an application hands the user to PAM to determine wheter the user can successfully identify itself. This way, it's relatively easy to add support for additional authentification techniques, such as two-factor and physical keys. In addition to authentication mechanisme support, PAM also provides a limited amount of authorization control for services (for ex.: if you'd like to deny a service like cron to certain users). Because there are many kinds of authentication scenarios, PAM employs a number of dynamically loadable authentication modules. Each module performs a specific task; for ex.: the pam_unix.so module can check a user's password. This is tricky business, to say the least. The programming interface isn't easy, and it's not clear that PAM actually solves all of the existing problems. Nevertheless, PAM support is in nearly every program that requires authentication on a Linux system, and most distributions use PAM. And because it works on top of the existing Unix authentication API, integrating support into a client requires little, if any, extra work.

PAM configuration

We'll explore the basic of how PAM works by examining its configuration. You'll normally find PAM's application configuration files in the /etc/pam.d direcotry (older systems may use a single /etc/pam.conf file). Most installations include many files, so you may not now where to start. Some filenames should correspond to parts of the system that you know already, such as cron and passwd. Because the specific configuration in these files varies significantly between distributions, it can be difficult to find a common example. We'll look at an example configuration line that you might find for chsh (the change shell program):

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 auth       requisite    pam_shell.so

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

This line says that the user's shell must be in /etc/shells in order for the user to successfully authenticate with the chsh program. Let's see how. Each configuration line has tree fields: a function type, control argument, and module, in that order. Here's what they mean for this example:

 Function type    The function that a user application asks PAM to perform. Here, it's auth, the task of authenticating the user. 
 Control argument This setting controls what PAM does after success or failure of its action for the current line (requisite in this example). We'll get go this shortly. 
 Module           The authentication module that runs for this line, determining what the line actually does. Here, the pam_shells.so module checks to see wheter the user's current 
                  shell is listed in /etc/shells. 

PAM configuration is detailed on the pam.conf(5) manual page. Let's look at a few of the essentials.

Function types

A user application can ask PAM to perform one of the following four functions:

 auth Authenticate a user (see if the user is who they say they are).
 account Check user account status (whether the user is authorized to do something, for ex.). 
 session Perform something only for the user's current session (such as displaying a message of the day).
 password Change a user's password or other credentials. 

For any configuration line, the module and function together determine PAM's action. A module can have more than one function type, so when determining the purpose of a configuration line, always remember to consider the function and module as a pair. For ex.: the pam_unix.so module checks a password when performing the auth function, but it sets a password when performing the password function.

Control arguments and Stacked rules

One important feature of PAM is that the rules specified by its configuration lines stack, meaning that you can apply many rules when performing a function. This is why the control argument is important: The success or failure of an action in one line can impact following lines or cause the entire function to succeed or fail.

There are two kinds of control arguments: the simple syntax and a more advanced syntax. Here are the tree major simple syntax control arguments that you'll find in a rule:

 sufficient If this rule succeeds, the authentication is successful, and PAM does not need to look at any more rules. If the rules fails, PAM proceeds to additional rules.  
 requisite  If this rule succeeds, PAM proceeds to additional rules. If the rule fails, the authentication is unsussessful, and PAM does not need to look at any more rules. 
 required   If this rule succeeds, PAM proceeds to additional rules. If the rule fails, PAM proceeds to additional rules but will always return an unsuccessful authentication 
            regardless of the end result of the additional rules. 

Continuing with the preceding example, here is an example stack ofr the chsh authentication function:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 auth       sufficient   pam_rootok.so
 auth       requisite    pam_shells.so
 auth       sufficient   pam_unix.so
 auth       required     pam_deny.so

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

With this configurationm when the chsh command asks PAM to perform the authentication function, PAM does the following:

 1. The pam_rootok.so module checks to see if the root user is the one trying to authenticate. If so, it immediately succeeds and attempts no further authentication. This works because 
    the control argument is set to sufficient. meaning that success from this action is good enough for PAM to immediately report success back to chsh. Otherwise, it proceeds to Step 2. 
 2. The pam_shells.so module checks to see if the user's shell is in /etc/shells. If the shell is not there, the module returns failure, and the requisite control argument indicates
    that PAM must immediately report this failure back to chsh and attempt no further authentication. Otherwise, the shell is in /etc/shells, so the module returns success and 
    fulfills the control flag of required; proceed to Step 3. 
 3. The pam_unix.so module ask the user for the user's password and checks it. The control argument is set to sufficient, so success from this module (a correct password) is enough for PAM to report 
    to chsh. If the pasword is incorrect, PAM continues to Step 4. 
 4. The pam_deny.so module always fails, and because the required control argument is present, PAM reports failure back to chsh. This is a default for when there's nothing 
    left to try. (Note that a required control argument does not cause PAM to fail its function immediately —— it will run any lines left on its stack —— but the report back to the application 
    will always be of failure.)

Note: Don't confuse the terms function and action when working with PAM. The function is the high-level goal: what the user application wants PAM to do (authenticate a user, for example). An action is a specific step that PAM takes in order to reach that goal. Just remember that the user application invokes the function first and that PAM takes care of the particulars with actions.

The advanced control argument syntax, denoted inside square brackets ([]), allow you to manually control a reaction based on the specific return value of the module (not just success or failure). For details, see the pam.conf(5) manual page; when you understand the simple syntax, you'll have no trouble with the advanced syntax.

Module Arguments

PAM modules can take arguments after the module name. You'll often encounter this example with the pam_unix.so module:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 auth       sufficient   pam_unix.so   nullok—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The nullok argument here says that the user can have no password (the default would be fail if the user has no password).

Notes on PAM

Due to its control flow capability and module argument syntax, the PAM configuration syntax has many features of a programming language and a certain degree of power. We've only scratched the surface so far, but here are a few more tips on PAM:

 * To find out which PAM modules are present on your system, try man -k pam_. It can be difficult to track down the location of modules. Try the <>codelocate unix_pam.so command and see 
   where that leads you. 
 * The manual pages contain the functions and arguments for each module. 
 * Many distributions automatically generate certain PAM configuration files, so it may not be wise to change them direclty in /etc/pam.d. Read the comments in your /etc/pam.d files before editing them; 
   if ther're generated files, the comment will tell you where they came from.
 * The /etc/pam.d/other configuration file defines the default configuration for any application that lacks its own configuration file. The default is often to deny everything.
 * There are different ways to include additional configuration files in a PAM configuration file. The @include syntax loads an entire configuration file, but you can also use a control argument to load only 
   the configuration for a particular function. The usage varies among distributions. 
 * PAM configurattion doesn't end with module arguments. Some modules can access additional files in /etc/security, usually to configure per-user restrictions. 

PAM and Passwords

Due to evolution of Linux password verification over the years, a number of password configuration artifacts remain that can cause confusion at times. The first to be aware of is the file /etc/login.defs. This is the configuration file for the original shadow password suite. It contains information about the encryption algorithm used for the shadow password file, but it's rarely used on a modern system with PAM installed, because the PAM configuration contains this information. This said, the encryption algorithm in /etc/login.defs should match the PAM configuration in the rare case that you run into an application that doen't support PAM.

Where does PAM get its information about the password encryption scheme? Recall that there are two ways for PAM to interact with passwords: the auth function (for verifying a password) and the password function (for setting a password). It's easiest to track down the password-setting parameter. The best way is probably just to grep it:

 $ grep password.*unix /etc/pam.d/*

The matching lines should contain pam_unix.so and look something like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 password           sufficient       pam_unix.so    obscure    sha512

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The arguments obscure and sha512 tell PAM what to do when setting a password. First, PAM checks to see if the password is obscure enough (that is, the password ins't too similar to the old password, among other things), and the PAM uses the SHA512 algoreithm to encrypt the new password.

But this happens only when a user sets a password, not when PAM is verifying a password. So how does PAM know which algortihm to use when authenticating? Unfortunately, the configuration won't tell you anything; there are no encryption arguments for pam_unix.so for the auth function. The manual pages also tell you nothing.

It turns out that (as of this writing) pam_unix.so simply tries to guess the algorithm, usually by asking the libcrypt library to do the dirty work of trying a whole bunch of things until something works or there's nothing left to try. Therefore, you normally don't have to worry about the verification encryption algorithm.

A closer look at processes and resources utilization

This chapter takes you deeper into the relationship between processes, the kernel, and system resources: CPU, memory, and I/O. Processes vie for these resources, and the kernel's job is to allocate resources fairly. The kernel itself is also a resource —— a software resource that processes use to perform tasks such as creating new processes and communicating with other processes.

Many of the tools that you see in this chapter are often thought of as performance-monitoring tools. They're particulary helpful if your system is slowing to a crawl and you're trying to figure out why. However, you shouldn't get too distracted by performance; trying to optimize a system that's already working correctly is often a waste of time. Instead, concentrate on understanding what the tools actually measure, and you'll gain great insight inti how the kernel works.

Tracking Processes

You learned how to use ps to list processes running on your system at a particular time. The ps command list current processes, but it does little to tell you how processes change over time. Therefore, it won't really help you to determine which process is using too much CPU time or memory. The top program is often more useful than ps because it displays the current system status as well as many of the fields in a ps listing, and it updates the display every second. Perhaps most important is that top shows the most active processes (that is, those currently taking up the most CPU time) at the top of its display.

You can send commands to top with keystorkes. These are some of the most important commands:

 Spacebar: Updates the dispay immediately. 
 M: Sorts by current resident memory usage. 
 T: Sorts by total (cumulative) CPU usage. 
 P: Sorts by current CPU usage (the default).
 u: Displays only one user's processes. 
 f: Selects different statistics to display. 
 ?: Displays a usage summary for all top commands (like an help option). 

Two other utilities for Linux, similar to top, offer an enchanced set of views and features: atop and htop. Most of the extra features are available from other utilities. For ex.: htop has many of abilities of the lsof command described in the next section.

Finding Open Files with lsof

The lsof command lists open files and the processes using them. Because Unix places a lot of emphasis on files, lsof is among the most useful tools for finding trouble spots. But lsof doesn't stop at regular files —— it can list network resources, dynamic libraries, pipes, and more.

Reading the lsof Output

Running lsof on the command line usually produces a tremendous amount of output. Below is a fragment of what you might see. This output includes open files from the init process as well as a runing vim process:

 $ lsof
 COMMAND     PID   TID  USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
 systemd    3064       alice  cwd       DIR                8,2      224         96 /
 systemd    3064       alice  rtd       DIR                8,2      224         96 /
 systemd    3064       alice  txt       REG                8,2  1645024  201495001 /usr/lib/systemd/systemd
 systemd    3064       alice  mem       REG                8,2    46912  134291898 /usr/lib/libnss_files-2.26.so
 systemd    3064       alice  mem       REG                8,2    46936  134291895 /usr/lib/libnss_nis-2.26.so
 --snip--
 vim       10522       alice  cwd       DIR                8,1     4096       2452 /home/alice
 vim       10522       alice  rtd       DIR                8,2      224         96 /
 vim       10522       alice  txt       REG                8,2  2991888  135621440 /usr/local/bin/vim
 vim       17373       alice    4u      REG                8,1     4096  273625927 /home/alice/.config/i3/.config.swp
 --snip--

The output shows the following fields (listed in the top row):

 COMMAND :The command name for the process that holds the file descriptor. 
 PID     :The process ID. 
 USER    :The user running process. 
 FD      :This field can contain two kind of elements. In the output above, the FD column shows the purpose ot the file. The FD field can also list the file descriptor of the open 
          file —— a number that a process uses together with the system libraries and kernel to identify and manipulate a file. 
 TYPE    :The file type (regular file, directory, socket, and so on). 
 DEVICE  :The major and minor number of the device that holds the file. 
 SIZE    :The file's size. 
 NODE    :The file's inode number. 
 NAME    :The filename. 

The lsof(1) manual page contains a full list of what you might see for each field, but you should be able to figure out what you're looking at just by looking at the output. For ex.: look at the entries with cwd in the FD field. These lines indicate the current working directories of the processes. Another example is the very last line, which show a file that the user is currently editing with vim.

Using lsof

There are two basic approaches to running lsof:

 * List everything and pipe the output to a command like less, and then search for what you're looking for. This can take a while due to the amount of output generated. 
 * Narrow down the list that lsof provides with command-line options. 

You can use command-line options to provide a filename as an argument and have lsof list only the entries that match the argument. For ex.: the following command displays entries for open files in /usr:

 $ lsof /usr

To list the open files for a particular PID, run:

 $ lsof -p pid

For a brief summary of lsof's many options, run lsof -h. Most options pertain to the output format.

Note: lsof is highly dependent on kernel information. If you upgrade your kernel and you're not routinely updating everything. You might need to upgrade lsof. In addition, if you perform a distribution update to both kernel and lsof, the updated lsof might not work until you reboot with the new kernel.

Tracing Program Execution and System Calls

The tools we've seen so far examine active processes. However, if you have no idea why a program dies almost immediately after starting up, even lsof won't help you. In fact, you'd have a difficult time even runing lsof concurrently with a failed command. The strace (system call trace) and ltrace (librarie trace) commands can help you discover what a program attempts to do. These tools produce extraordinarily large amounts of output, but once you know what to look for, you'll have more tools at your disposal for tracking down problems.

strace

Recall that a system call is a privileged operation that a user-space process asks the kernel to perform, such as opening and reading data from a file. The strace utility prints all the system calls that a process makes. To see it in action, run this command:

 $ strace cat /dev/null

In Chapter 1, you learned that when one process wants to start antoher process, it invokes the fork() system call to spawn a copy of itself, and then the copy uses a member of the exec() family of system calls to start running a new program. The strace command begins working on the new process (the copy of the original process) just after the fork() call. Therefore, the first lines of the output from this command should show execve() in action, followed by a memory initialization call, brk(), as follows:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 execve("/usr/bin/cat", ["cat", "/dev/null"], 0x7ffc2cddfe78 /* 47 vars */) = 0
 brk(NULL)                               = 0x55bee9330000

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The next part of the outpu deals primarily with loading shared librairies. You can ignore this unless you really want to know what the shared library system does.

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
 fstat(3, {st_mode=S_IFREG|0644, st_size=223145, ...}) = 0
 --snip--

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

In addition, skip past the mmap output until you get to the lines that look like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
 openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
 fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
 fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
 read(3, "", 131072)                     = 0
 close(3)                                = 0
 close(1)                                = 0
 close(2)                                = 0
 exit_group(0)                           = ?

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

This part of the output shows the command at work, First, look at the openat() call, which opens a file. The 3 is a result that means success (3 is the file descriptor that the kernel returns after opening the file). Below that, you see where cat reads from /dev/null (the read() call, which als has 3 ad the file descriptor). Then there's nothing more to read, so the program closes the file descriptor and exits with exit_group(). WHat happens when there's a problem? Try strace cat not_a_file instead and examine the openat() call in the resulting output:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 openat(AT_FDCWD, "/home/alice/thisnotexist.txt", O_RDONLY) = -1 ENOENT (No such file or directory)

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Because openat() couldn't open the file, it returned -1 to signal an error. You can see that strace reports the exact error and gives you a small description of the error. Missing files are the most common problems with Unix programs, so if the system log and other log information aren/t very helpful and you have nowhere else to turn, strace can be of great use. You can even use it on daemons that detach themselves. For ex.:

 $ strace -o crummyd_strace -ff crummyd

In this example, the -o option to strace logs the action of any child process that crummyd spawns into crummyd_strace.pid, where pid is the process ID of the child process.

Ex.:

 $ strace -o firefox_strace -ff firefox

ltrace

The ltrace command tracks shared library calls. The output is similar to that of strace, which is why we're mentioning it here, but it doesn't track anything at the kernel level. Be werned that there are many more shared library calls than system calls. You'll definitely need to filter the output, and ltrace itself has many buit-in options to assist you.

Threads

In Linux, some processes are divided into pieces called threads. A thread is very similar to a process —— it has an identifier (TID, or thread ID), and the kernel shedules and runs threads just like processes. However, unlike separate processes, which usually do not share system resouces such as memory and I/O connections with other processes, all threads inside a single process share their system resource and some memory.

Single-Thread and Multithread Processes

Many processes have only one thread. A process with one thread is single-threaded, and a process with more than one thread is multithreaded. All processes start out single-threaded. This starting thread is usually called the main thread. The main thread may then start new threads in order for the process to become multithreaded, similar to the way a process can call fork() to start a new process.

Note: It's rare to refer to threads at all when a process is single-threated. We will not mention threads unless multithreated processes make a difference in what you see or experience.

The primary advantage of a multithreated process is that when the process has a lot to do, threads can run simultaneously on multiple processors, potentially speeding up computation. Although you can also achieve simultaneous computation with multiple processes, threads start faster than processes, and it is often easier and/or more efficient for threads to intercommunicate using their shared memory than it is for processes to communicate over a channel such as a network connection or a pipe. Some programs use threads to overcome problems managing multiple I/O resources. Traditionally, a process would sometimes use fork() to start a new subprocess in order to deal with a new input or output stream. Threads offer a similar mechanism without the overhead of starting a new process.

Viewing Threads

By default, the output from ps and top commands show only processes. To display the thread information in ps, add the m option. Here is some sample output:

 $ ps m
   PID TTY      STAT   TIME COMMAND
 3623 pts/1    -      0:00 /bin/bash
     - -        Ss+    0:00 -
 20508 pts/0    -      0:00 /bin/bash
     - -        Ss     0:00 -
 12287 pts/8    -      0:54 /usr/bin/python /usr/bin/gm-notify
     - -        SL1    0:48 -
     - -        SL1    0:00 -
     - -        SL1    0:06 -
     - -        SL1    0:00 -

It shows processes along with threads, Each line with a number in the PID colomn represents a process, as in the normal ps output. The lines with the dashes in the PID column represent the threads associated with the process. In this output, the processes have only one thread each, but process 12287 is multithreaded with four threads. If you would like to view the thread IDs with ps, you can use a custom output format. This example shows only the process IDs, thread IDs, and command:

 $ ps m -o pid,tid,command  
   PID   TID COMMAND
 3623      - /bin/bash
     -  3623 -
 24726     - /bin/bash
     - 24726 -
 12287     - /usr/bin/python /usr/bin/gm-notify
     - 12287 -
     - 12288 -
     - 12289 -
     - 12295 -

The sample output correspond to the threads shown with the ps m command. Notice that the thread IDs of the single-threated processes are identical to the PID; this is the main thread. For the multithreaded process 12287, thread 12287 is also the main thread.

Note: Normally, you won't interact with individual threads as you would processes. You need to know a lot about how a multithreaded program was written in order to act on one thread at a time, and even then, doing so might not be a good idea.

Threads can confuse things when it comes to resource monitoring because individual threads in a multithreaded process can consume resources simultaneously. For ex.: top doesn't show threads by default; you'll need to press H to turn it on. For most of the resource monitoring tools that you're about to see, you'll have to do a little extra work to turn on the thread display.

Introduction to Resource Monitoring

Now we'll discuss some topics in resource monitoring, including processor (CPU) time, memory and disk I/O. We'll examine utilization on a systemwide scale, as well as on a per-process basis.

Measuring CPU Time

To monitor one or more specific processes over time, use the -p option to top, with this syntax:

 $ top -p pid1 [-p pid2 ...]

To find out how much CPU time a command uses during its lifetime, use time. Most shells have a built-in time command that doesn't provide extensive statistics, so you'll probably need to run /usr/bin/time. For ex.: to measure the CPU time used by ls, run:

 $ /usr/bin/time ls

After ls terminates, time should print output like that below. The key fields are in boldface:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 real	0m0,003s
 user	0m0,000s
 sys	0m0,000s

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 User time The number of seconds that the CPU has spent running the program's own code. On modern processors, some commands run so quickly, and therefore the CPU time is so low, that time 
           rounds down to zero. 
 System time How much time the kernel spends doing the process's work (for ex.: reading files and directories). 
 Real time The total time it took to run the process from start to finish, including the time that the CPU spent doing other tasks. This number is normally not very useful for performance 
           measurement, but subtracting the user and system time from real time can give you a general idea of how long the process spends waiting for system resources. 

Adjusting Process Priorities

You can change the way the kernel schedules a process in order to give the process more or less CPU time than other processes. The kernel runs each process according to its scheduling priority, which is a number between -20 and 20, with -20 being the foremost priority. (Yes, this can be confusing.)

The ps -l command lists the current priority of a process, but it's a little easier to see the priorities in action with the top command, as shown here:

 $ top
 Tasks: 153 total,   1 running, 151 sleeping,   0 stopped,   1 zombie
 %Cpu0  :   0,0/0,0     0[                                                                                                ]
 %Cpu1  :   5,8/0,6     6[|||||||                                                                                         ]
 %Cpu2  :   0,0/0,0     0[                                                                                                ]
 %Cpu3  :   1,4/0,0     1[|                                                                                               ]
 GiB Mem : 10,8/15,606   [                                                                                                ]
 GiB Swap:  0,0/7,978    [                                                                                                ]
  PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND                                                       
    1 root      20   0  153,3m   7,8m   0,0  0,0   0:03.36 S systemd                                                       
  179 root      20   0  100,7m  24,7m   0,0  0,2   0:00.56 S  `- systemd-journal                                           
  208 root      20   0   85,3m   7,9m   0,0  0,0   0:00.33 S  `- systemd-udevd                                             
  316 root      20   0   68,7m   5,6m   0,0  0,0   0:00.89 S  `- systemd-logind                                            
  317 dbus      20   0   38,2m   3,9m   0,0  0,0   0:01.21 S  `- dbus-daemon                                               
  319 root      20   0   19,0m   2,9m   0,0  0,0   0:00.09 S  `- crond                                                     
  320 root      20   0  458,1m  17,3m   0,0  0,1   0:00.43 S  `- NetworkManager                                            
  344 mpd       20   0  791,3m  43,7m   4,0  0,3   5:32.40 S  `- mpd                                                       
  349 root      20   0  347,2m   6,5m   0,0  0,0   0:00.03 S  `- lightdm                                                   
  362 root      20   0  277,3m  77,7m   4,0  0,5  10:10.65 S      `- Xorg                                                  
 3048 root      20   0  259,9m   7,5m   0,0  0,0   0:00.02 S      `- lightdm                                               
 3081 alice     20   0  111,3m   7,9m   0,0  0,0   0:02.53 S          `- i3                                                
  351 ntp       20   0  106,6m   4,2m   0,0  0,0   0:00.86 S  `- ntpd                                                      
  363 root      20   0  277,1m   6,5m   0,0  0,0   0:00.20 S  `- accounts-daemon                                           
  367 polkitd   20   0  521,4m  17,9m   0,0  0,1   0:00.44 S  `- polkitd                                                   
 3064 alice     20   0   80,8m   7,6m   0,0  0,0   0:00.03 S  `- systemd                                                   
 3065 alice     20   0  132,3m   2,2m   0,0  0,0   0:00.00 S      `- (sd-pam)                                              
 3094 alice     20   0   38,0m   3,8m   0,0  0,0   0:00.13 S      `- dbus-daemon                                           
 3131 alice     20   0  274,6m   6,7m   0,0  0,0   0:00.02 S      `- gvfsd                                                 
 3137 alice     20   0  404,0m   5,7m   0,0  0,0   0:00.00 S      `- gvfsd-fuse                                            
 3147 alice     20   0  336,7m   5,8m   0,0  0,0   0:00.01 S      `- at-spi-bus-laun                                       
 3152 alice     20   0   37,7m   3,1m   0,0  0,0   0:00.50 S          `- dbus-daemon                                       
 3154 alice     20   0   55,6m   4,8m   0,0  0,0   0:00.01 S      `- xfconfd                    

In the top output abovem the PR (priority) column lists the kernel's current schedule priority for the process. The higher the number, the less likely the kernel is to schedule the process if others need CPU time. The schedule priority alone does not determine the kernel's decision to give CPU time to a process, and it changes frequently during program execution according to the amount of CPU time that the process consumes. Next to the priority column is the nice value (NI) column, which gives a hint to the kernel's scheduler. This is what you care about when trying to influence the kernel's decision. The kernel adds the nice value to the current priority to determine the next time slot for the process.

By default, the nice value is 0. Now, say you're running a big computation in the background that you don't want to bog down your interactive session. To have that process take a backseat to other processes and run only when the other tasks have nothing to do, you could change the nice value to 20 with the renice command (where pid is the process ID of the process you want to change):

 $ renice 20 pid

If you're the superuser, you can se the nice value to a negative number, but doing so is almost always a bad idea because system processes may not get enough CPU time. In fact, you probably won't need to alter nice values much because many Linux systems have only a single user, and that user does not perform much real computation. (The nice value was much more important back when there were many users on a single machine.)

Load Averages

CPU performance is one of the easier metrics to measure. The load average is the average number of processes currently ready to run. That is, it is an estimate of the number of processes that are capable of using the CPU at any given time. When thinking about a load average, keep in mind that most processes on your system are usually waiting for input (from the keyboard, mouse, or network, for example), meaning that most processes are not ready to run and should contribute nothing to the load average. Only processes that are actually doing something affect the load average.

Using uptime

The uptime command tells you three load averages in addition to how long the kernel has been running:

 $ uptime
 15:10:48 up  5:26,  1 user,  load average: 0,08, 0,03, 0,01

The three bolded numbers are the load averages for the past 1 minute, 5 minutes, and 15 minutes, respectively. As you can see, this system isn't very busy: An average of only 0,01 processes have been running accross all processors for the past 15 minutes. In other words, if you had just one processor, it was only running user-space applications for 1 percent of the last 15 minutes. (Traditionally, most desktop systems would exhibit a load average of about 0 when you were doing anything expect compliling a program or playing a gme. A load average of ) is usually a good sign, because it means that your processor isn't being challegend and your're saving power.)

Note: User interface components on current desktop systems tends to occupy more of the CPU than those in the past. For ex.: On Linux systems, a web browser's Flash plugin can be a particularly notorious resource hog, and Flash applications can easily occupy much of a system's CPU and memory due to poor all-around implementation.

If a load average goes up to arround 1, a single process is probably using the CPU nearly all of the time. To identify that process, use the top command; the process will usually rise to the top of the display. Most modern systems have more than one processor core or CPU, so multiple processes can easily run simultaneously. If you have two cores, a load of average of 1 means that only one of the cores is likely active at any given time, and a load average of 2 means that both cores have just enough to do all of the time.

High Loads

A high load average does not necessarily mean that your system is having trouble. A system with enough memory and I/O resources can easily handle many running processes. If your load average is high and your system still responds well, don't panic: The system just has a lot of processes sharing the CPU. The processes have to compete with each other for processor time, and as a result they'll take longer to perform their computations than they would if they were each allowed to use the CPU all of the time. Another case where you might see a high load average as normal is a web server, where processes can start and terminate so quickly that the load average measurement mechanisme can't function effectively.

However, if you sense that the system is slow and the load average is high, you might be running into memory performance problems. When the system is low on memory, the kernel can start to thrash, or rapidly swap memory for processes to and from disk. When this happens, many processes will become ready to run, but their memory might not be available, so they will remain in the ready-to-run state (and contribute to the load average) for much longer than they normally would. We'll look at memory in much more detail.

Memory

One of the simplest ways to check your system's memory status as a whole is to run the free command or view /proc/meminfo to see how much real memory is being used for caches and buffers. As we've just mentioned, much cache/buffer memory being used (and the rest of the real memory is taken), you may need more memory. However, it's too easy to blame a shortage of memory for every perofrmance problem on your machine.

How Memory Works

Recall from Chapter 1 that the CPU has a memory management unit (MMU) that translates the virtual memory addresses used by processes into real one. The kernel assists the MMU by breaking the memory used by processes into smaller chunks called pages. The kernel maintains a data page addresses to real page addresses in memory. As a process accesses memory, the MMU translates the virtual addresses used by the process into real addresses based on the kernel's page table.

A user process does not actually need all of its pages to be immediately available in order to run. The kernel generally loads and allocates pages as a process needs them; this system is known as on-demand paging or just demand pagin. To see how this works, consider how a program starts and runs as a new process:

 1. The kernel loads the beginning of the program's instruction code into memory pages. 
 2. The kernel may allocate some working-memory pages to the new process. 
 3. As the process runs, it might reach a point where the next instruction in its code isn't in any of the pages that the kernel initially loaded. At this point, the kernel take over, loads the necessary pages 
    into memory, and then lets the program resume execution.
 4. Similarly, if the program requires more working memory than was initially allocated, the kernel handles it by finding free pages (or by making room) and assigning them to the process. 

Page Faults

If a memory is not ready when a process wants to use it, the process triggers a page fault. In the event of a page fault, the kernel takes control of the CPU from the process in order to get the page ready. There are two kinds of page faults: minor and major.

Minor Page Faults

A minor page fault occurs when the desired page is actually in main memory but the MMU doesn't know there it is. This can happen when the process requests more memory or when the MMU doesn't have enough space to store all of the page locations for a process. In this case, the kernel tells the MMU about the page and permits the process to continue. Minor page faults aren't such a big deal, and many occur as a process runs. Unless you need maximum performance from some memory-intensive program, you probably souldn't worry about them.


Major Page Faults

A major page fault occurs when the desired memory page isn't in main memory at all, which means that the kernel must load it from the disk or some other slow storage mechanism. A lot of major faults will bog the system down because the kernel must do a substantial amount of work to provide the pages, robbing normal processes of their chance to run. Some major page faults are unavoidable, such as those that occur when you load the code from disk when running a program for the first time. The biggest problems happen when you start running out of memory (OOM) and the kernel starts to swap pages of working memory out to the disk in order to make room for new pages.

Watching Page Faults

You can drill down to the page faults for individual processess with the ps, top, and time commands. The following command shows a simple example of how the time command displays page faults. (The output of the cal command doesn't matter, so we're discarding it by redirecting that to /dev/null.)

 $ /usr/bin/time cal > /dev/null
 0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 2544maxresident)k
 0inputs+0outputs (2major+138minor)pagefaults 0swaps

As you can see from the bolded text, when this program ran, there were 2 major page faults and 254 minor page ones. The major page faults occurred when the kernel needed to load the program from the disk for the first time. If you ran the command again, you probably wouldn't get any major page faults because the kernel would have cached the pages from the disk. If you'd rather see the page faults of processes as they're running, use top or ps. When running top, use f to change the displayed fields and u to display the number of major page faults. (The results will show up in a new, nFTL column. You won't see the minor page faults.)

When using ps, you can use a custom output format to view the page faults for a particular process. Here's an example for process ID 3081:

 $ ps -o pid,min_flt,maj_flt 3081
   PID  MINFL  MAJFL
  3081   3726     16

The MINFL and MAJFL columns show the numbers of minor and major page faults. Of course, you can combine this with any other process selection options, as described in the ps(1) manual page.

Viewing page faults by process can help you zero in on certain problematic components. However, if you're interested in your system performance as a whole, you need a tool to summarize CPU and memory action across all processes.

Monitoring CPU and Memory Performance with vmstat

Among the many tools available to monitor system performance, the vmstat command is one of the oldest, with minimal overhead. You'll find it handy for getting a high-level view of how often the kernel is swapping pages in and out, how busy the CPU is, and IO utilization. The trick to unlocking the power of vmstat is to understand its output. For ex.: here's some output from vmstat 2, which reports statistics every 2 seconds:

 $ vmstat 2
 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
  r  b   swpd    free   buff   cache   si   so    bi    bo   in   cs us sy id wa st
  0  0      0 9720380  35232 5055412    0    0     1     1    2  261 12  2 86  0  0
  0  0      0 9726300  35232 5055308    0    0     0   130  462 1168  1  1 97  0  0
  0  0      0 9725708  35232 5055308    0    0     0     0  539 1972  1  1 98  0  0
  0  0      0 9725724  35232 5055308    0    0     0     0  437 1434  1  1 98  0  0
  0  0      0 9725676  35232 5055308    0    0     0     0  503 1545  4  2 94  0  0
  0  0      0 9726048  35232 5055308    0    0     0     0  467 1323  2  1 97  0  0
  0  0      0 9725676  35232 5055300    0    0     0     0  597 3085  2  1 97  0  0

The output falls into categories: procs for processes, memory for memory usage, swap for the pages pulled in and out of swap, io for disk usage, system for the number of times the kernel switches into kernel code, and cpu for the time used by different parts of the system.

The preceding output is typical for a system that isn't doing much. You'll usually start looking at the second line of the output —— the first one is an average for the entire uptime of the system. For ex.: here the system has 0KB of memory swapped out to the disk (swpd) and around 9720380KB (9GB) of real memory free. Even though some swap space is in use, the zero-valued si (swap in) and so (swap out) columns report that the kernel is not currently swapping anything in or out from the disk buffers (see Section 4.2.5).

On the far right, under the CPU heading, you see the distribution of CPU time in the us, sy, id, and wa columns. These list (in order) the percentage of time that the CPU is spending on user tasks, system (kernel) tasks, idle time, and waiting for I/O. In the preceding example, there aren't too many user processes running (they're using a maximum of 1 percent of the CPU); the kernel is doing practically nothing, while the CPU is sitting aroud doing nothing 99 percent of the time.

Now, watch what happens when a big program starts up sometime later (the first two lines occur right before the progam runs):

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

0  0      0 9651908  35244 5091892    0    0     0     0  544 2284  2  1 97  0  0
0  0      0 9695744  35244 5091892    0    0     0     0  926 7054  7  1 92  0  0
0  1      0 9652288  35244 5091904  202    0    66     2 1348 7917 15  4 81  0  0
0  2      0 9695248  35244 5092660   10    0   384     1  476 2007  3  1 96  0  0

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

As you can see, the CPU starts to see more usage for an extended period, especially from user processes. Because there is enough free memory, the amnount of cache and buffer space used starts to increase as the kernel starts to use the disk more. Later on, we see something interesting: The kernel pulls some pages into memory that were once swapped out (the si column). This means that the program that just ran probably accessed some pages shared by another process. This is common; many processes use the code in certain shared libraries only when starting up.

Also notice from the b' column that a few processes are blocked (prevented from running) while waiting for memory pages. Overall, the amount of free memory is decreasing, but it's nowhere near being depleted. There's also a fair amount of disk activity, as seen by the increasing numbers in the bi (blocks in) and bo (blocks out) columns.

The output is quite different when you run out of memory (OOM). As the free space depletes, both the buffer and cache size decrease because the kernel increasingly needs the space for user processes. Once there is nothing left, you'll start to see activity in the so (swapped out) column as the kernel starts moving pages onto the disk, at which point nearly all of the other outputcolumns change to reflect the amount of work that the kernel is doing. You see more system time, more data going in and out of the disk, and more processes blocked because the memory they want to use is not availble (it has been swapped out).

We haven't explained all of the vmstat output columns. You can dig deeper into them in the vmstat(8) manual page, but you might have to learn more about kernel memory management first from a class or a book like Operating System Concepts in order to understand them.

I/O Monitoring

By default, vmstat shows up some general I/O statistics. Although you can get very detailed per-partition resource usage with vmstat -d, you'll get a lot of output from this option, which might be overwhelming. Instead, try starting out with a tool just for I/O called iostat.

Using iostat

Like vmstat, when running without any options, iostat shows the statistics for your machine's current uptime:

 $ iostat
 Linux 4.9.62-1-lts (game) 	19/11/17 	_x86_64_	(4 CPU)
 
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           11,45    0,00    1,88    0,04    0,00   86,63

 Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
 sdb               1,25       133,11         0,27    3543769       7108
 sda               2,43        45,01       376,67    1198384   10028274
 sdc               0,01         0,18         0,00       4716          0

The avg-cpu part at the top reports the same CPU utilization information as other utilities that you've seen in this chapter, so skip down to the bottom, which shows you the following for each device:

 tps       Average number of data transfers per second. 
 kB_read/s Average number of kilobytes read per second. 
 kB_wrtn/s Average number of kilobytes write per second. 
 kB_read   Total number of kilobytes read.
 kB_wrtn   Total number of kilobytes written. 

Another similarity to vmstat is that you can give an interval argument, such as iostat 2, to give an update every 2 seconds. When using an interval, you might want to display only the device report by using the -d option (such as iostat -d 2).

By default, the iostat output omits partition information. To show all of the partition information, use the -p ALL option. Because there are many partitions on a typical system, you'll get a lot of output. Here's part of what you might see:

 $ iostat -p ALL 
 Linux 4.9.62-1-lts (game) 	19/11/17 	_x86_64_	(4 CPU)
 
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           11,35    0,00    1,88    0,04    0,00   86,73
 
 Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
 sdb               1,24       131,68         0,27    3564141       7176
 sdb1              1,23       131,60         0,27    3562017       7176
 sda               2,42        44,27       371,15    1198384   10046020
 sda1              0,68        12,01        45,95     325197    1243640
 sda2              1,73        32,10        16,16     868863     437327
 sda3              0,00         0,08       309,05       2216    8365052
 sdc               0,01         0,17         0,00       4716          0
 sdc1              0,00         0,14         0,00       3680          0
 sdc2              0,00         0,00         0,00          0          0

In this example, sda1, sda2, sda3, are all partitions of the sda disk, so there will be some overlap between the read and write columns. However, the sum of the partition columns won't necessarily add up to the disk column. Although a read from sda1 also counts as a read from sda, keep in mind that you can read from sda directly, such as when reading the partition table.

Per-Process I/O Utilization and Monitoring: iotp

If you need to dig even deeper to see I/O resources used by individual processes, the iotop tool can help. Using iotop is similar to using top. There is a continouosly updating display that shows the processes using the most I/O, with a general summary at the top:

 $ iotop
 Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
 Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                     
 23061 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.04 % [kworker/u8:1]
     1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
     2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
     3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
     7 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]
     8 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_bh]
     9 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
    10 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [lru-add-drain]
    11 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/0]
    12 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/1]
    13 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
    14 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
    16 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/1:0H]
    17 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/2]
    18 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
    19 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
    22 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/3]
    23 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
    24 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
    26 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/3:0H]
    27 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kdevtmpfs]
    28 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [netns]

Along with the user, command, and read/write columns, notice that there is a TID column (thread ID) instead of a process ID. The iotop tool is one of the few utilities that displays threads instead of processes. Thre prio (priority) column indicates the I/O priority. It's similar to the CPU priority that you've already seen, but it affects how quickly the kernel schedules I/O reads and writes for the process. In a priority such as be/4, the be part is the scheduling class, and the number is the priority level. As with CPU priorities, lower numbers are more important; for ex: the kernel allows more time for I/O for a process with be/3 than one with be/4. The kernel uses the scheduling class to add more control for I/O scheduling. You'll see three scheduling classes from iotop:

 be   Best-effort. The kernel does its best to fairly schedule I/O for this class. Most processes run under this I/O scheduling class. 
 rt   Real-time. The kernel schedules any real-time I/O before any other class of I/O, no matter what. 
 idle Idle. The kernel performs I/O for this class only when there is no other I/O to be done. There is no priority level for the idle scheduling class. 

You can check and change the I/O priority for a process with the ionice utility, see the ionice(1) manual page for details. You probably will never need to worry about the I/O priority, though. \

Per-Process Monitoring with pidstat

You've seen how you can monitor specific processes with utilities such as top and iotop. However, this display refreshed over time, and each update erases the previous output. The pidstat utility allows you to see the resource consumption of a process over time in the style of vmstat. Here's a simple example for monitoring process 320, updating every second:

 $ pidstat -p 320 1
 Linux 4.9.62-1-lts (game) 	19/11/17 	_x86_64_	(4 CPU)
 
 18:45:19      UID       PID    %usr %system  %guest    %CPU   CPU  Command
 18:45:20        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:21        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:22        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:23        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:24        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:25        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:26        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:27        0       320    0,00    0,00    0,00    0,00     0  NetworkManager
 18:45:28        0       320    0,00    0,00    0,00    0,00     0  NetworkManager

The default output shows the percentages of user and system time and the overall percentage of CPU time, and it even tells you wich CPU the process was running on. (The %guest column here is somewhat odd —— it's the percentage of time that the process spent running something inside a virtual machine. Unless you're running a virtual machine, don't worry about this.)

Although pidstat shows CPU utilization by default, it can do much more. For ex.: you can use the -r option to monitore memory and -d option to turn in disk monitoring. Try them out, and then look at the pidstat(1) manual page to see even more options for threads, context switching, or just about anythong else that we've talking about in this chapter.

Understanding Your Network and Its Configuration

Networking is the practice of connecting computers and sending data between them. That sounds simple enough, but to understand how it works, you need to ask two fundamental questions:

 * How does the computer sending the data know where to send its data? 
 * When the destination computer receives the data, how does it know what it just received? 

A computer answers these questions by using a series of components, with each one responsible for a certain aspect of sending, receiveing, and identifying data. The components are aranged in groups that form network layers, which stack on top of each other in order to form a complete system. The Linux kernel handles networking in a similar way to the SCSI subsystem described in Chapter 3. Because each layer tends to be independent, it's possible to build networks with many different combinations of components. This is where network configuration can become very complicated. For this reason, we'll begin this chapter by looking at the layers in very simple networks. You'll learn how to view your own network settings, and when you understand the basic workings of each layer, you'll be ready to learn how to configure those layers by yourself. Finally, you'll move on to more advanced topics like building your own networks and configuring firewalls.

Network Basics

Before getting into the theory of network layers, take a look at the simple network shown in Figure 9-1.

Figure9-1.png

Figure 9-1: is a typical local area network with a router that provides Internet access.

This type of network is ubiquitous; most home and small office networks are configured this way. Each machine connected to the network is called a host. The hosts are connected to a router, which is a host that can move data from one network to another. These machines (here, Hosts A, B, C, and D) and the router from a local area network (LAN). The connections on the LAN can be wired or wireless. The router is also connected to the Internet —— the earth in the figure. Because the router is connected to both the LAN and the Internet, all machines on the LAN also have access to the Internet through the router. One of the goals of this chapter is to see how the router provides this access. Your initial point of view will be from a Linux-based machine such as Host A on the LAN in Figure 9-1.

Packets

A computer transmits data over a network in small chunks called packets, which consits of two parts: a header and a payload. The header contains identifying information such as the origin/destination hosts and basic protocol. The payload, on the other hand, is the actual appliaction data that the computer wants to send (for ex.: HTML or image data). Packets allow a host to communicate with others "simultaneously", because hosts can send, receive, and process packets in any order, regardless of where they came from or where they're going. Breaking messages into smaller units also makes it easier to detect and compensate for errors in transmission. For the most part, you don't have to worry about translating between packets and the data that your application uses, because the OS has facilities that do this for you. However, it is helpful to know the role of packets in the network layers that you're about to see.

Network Layers

A fully functioning network includes a full set of network layers called a network stack. Any functional network has a stack. The typical Internet stack, for the top of bottom layer, looks like this:

 Application layer: Contains the "language" that applications and servers use to communicate; usually a high-level protocol of some sort. Common application layer protocols 
                    include Hypertext Transfer Protocol (HTTP, used for the web), Secure Socket Layer (SSL), and File Transfer Protocol (FTP). Application layer protocols can often 
                    be combined. For ex.: SSL is commonly used in conjuction with HTTP. 
 
 Transport layer: Defines the data transmission characteristics of the application layer. This layer includes data integrity checking, source and destination ports, and specifications 
                  for breaking application data into packets (if the application layer has not already done so). Transmission Contro Protocol (TCP) and User Datagram Protocol (UDP) are the 
                  most common transport layer protocols. The transport layer is also sometimes called the protocol layer. 
 
 Network or Internet layer: Defines how to move packets from a source host to a destination host. The particular packet transit rule set for the Internet is known as Internet Protocol (IP). 
                            Because we'll only talk about Internet networks, we'll really only be talking about Internet layer. However, because network layers are meant to be hardware independent, 
                            you can simultaneously configure several independant network layer (such as IP, IPv6, IPX, and AppleTalk) on a single host. 
 
 Physical layer: Defines how to send raw data accross a physical medium, such as Ethernet or a modem. This is sometimes called the link layer or host-to-network layer. 

It's important to understand the structure of a network stack because your data must travel through these layers at least twice before it reaches a program at its destination. For ex.: if you're sending data from Host A to Host B, as shown in Figure 9-1, your bytes leave the application layer on Host A and travel through the transport and network layers on Host A; then they go down to the physical medium, accross the medium, and up again through the various lower levels to the application layer on Host B in much the same way. If you're sending something to a host on the Internet through the router, it will go through some (but usually not all) of the layers on the router and anything else in between. The layers sometimes bleed into each other in strange ways because it can be inefficient to process all of them in order. For ex.: devices that historically dealt with only the physical layer now sometimes look at the transport and Internet layer data to filter and route data quickly.

We'll begin by looking at how your Linux machine connects to the network in order to answer the where question at the beginning of the chapter. This is the lower part of the stack —— the physical and network layers. Later, we'll look at the upper two layers that answer the what question.

Note: You might have heard of another set of layer known as the Open System Interconnection (OSI) Reference Model. This is a seven-layer network model often used in teaching and designing networks. We won't cover the OSI model because you'll be working directly with the four layers described here.

The Internet Layer

Rather than start at the very bottom of the network stack with the physical layer, we'll start at the network layer because it can be easier to understand. The Internet as we currently know it is based on the Internet Protocol, version 4 (IPv4), though version 6 (IPv6) is gaining adoption. One of the most important aspects of the Internet layer is that it's mean to be a software network that places no particular requirements on hardware or OS. The idea is that you can send and receive Internet packets over any kind of hardware, using any OS. The Internet's topology is decentralized: it's made up of smaller networks called subnets. The idea is that all subnets are interconnected in some way. For ex.: in Figure 9-1, the LAN is normally a single subnet.

A host can be attached to more than one subnet. As you saw in section 9.1, that kind of host is called a router if it can transmit data from one subnet to another (another term for router is gateway). Figure 9-2 refines Figure 9-1 by identifying the LAN as a subnet, as well as Internet addresses for each host and the router. The router in the figure has two addresses, the local subnet 10.23.2.1 and the link to the Internet (but this Internet link's address is not important right now so it's just marked "Uplink Address").

We'll look first at the addresses and then the subnet notation. Each Internet host has at least one numeric IP address in the form of a.b.c.d, such as 10.23.2.37. An address in this notation is called a dotted-quad sequence. If a host is connected to multiple subnets, it has at least one IP address per subnet. Each host's IP address should be unique across the entire Internet, but as you'll see later, private networks and NAT can make this a little confusing.

Figure9-2.png

Note: Technically, an IP address consist of 4 bytes (or 32 bits), abcd. Bytes a and d are numbers from 1 to 254, and b and c are numbers from 0 to 255. A computer processes IP addresses as raw bytes. However, it's much easier for a human to read and write a dotted-quad address, such as 10.23.2.37, instead of something ugly like the hexadecimal 0x0A170225.

IP addresses are like postal addresses in some ways. To communicate with another host, your machine must know that other host's IP addrss. Let's take a look at the address on your machine.

Viewing Your Computer's IP Addresses

One host can have many IP addresses. To see the addresses that are active on your Linux machine, run:

 $ ifconfig  #or with ip command: ip a s

There will probably be a lot of output, but it should include something like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 enp0s31f6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 10.23.2.4  netmask 255.255.255.0  broadcast 10.23.2.255
         inet6 fe80::30a9:9625:4298:4e3b  prefixlen 64  scopeid 0x20<link>
         ether 38:d5:47:1b:ae:b4  txqueuelen 1000  (Ethernet)
         RX packets 324876  bytes 163980102 (156.3 MiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 327267  bytes 107707100 (102.7 MiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
         device interrupt 16  memory 0xf7200000-f7220000  

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The ifconfig command's output includes many details from both the Internet layer and the physical layer. (Sometimes it doesn't even include an Internet address at all!) We'll discuss the output in more detail later, but for now, concentrate on the second line, which reports that the host is configured to have an IPv4 address (inet addr) of 10.23.2.4. On the same line, a Mask is reported as being 255.255.255.0. This is a subnet mask, which defines the subnet that an IP address belong to. Let's see how that works.

Note: The ifconfig command, as well some of the others you'll see later in this chapter (such as route and arp), has been technically supplanted with the newer ip command. The ip can do more than old commands, and it is preferable when writing scripts. However, most people still use the old commands when manually working with the network, and these commands can also be used on other versions of Unix. For this reason, we'll use the old-style commands.

Subnets

A subnet is a connected group of hosts with IP addresses in some sort of order. Usually, the hosts are on the same physical network, as shown in Figure 9-2. For ex.: the hosts between 10.23.2.1 and 10.23.2.254 could comprise a subnet, as could all hosts between 10.23.1.1 and 10.23.255.254. You define a subnet with two pieces: a network prefix and a subnet mask (such as the one in the output of ifconfig in the previous section). Let's say you want to create a subnet containing the IP addresses between 10.23.2.1 and 10.23.2.254. The network prefix is the part that is common to all addresses in the subnet; in this example, it is 10.23.2.0, and the subnet mask is 255.255.255.0. Let's see why those are the right numbers. It's not immediately clear how the prefix and mask work together to give you all possible IP addresses on a subnet. Looking at the numbers in binary form helps clear it up. The mask marks the bit locations in an IP address that are common to the subnet. For ex.: here are the binaru forms of 10.23.2.0 and 255.255.255.0:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Network:   10.23.2.0            00001010.00010111.00000010. 00000000
 Netmask:   255.255.255.0 = 24   11111111.11111111.11111111. 00000000

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Now, let's use boldface to mark the bit locations in 10.23.2.0 that are 1s in 255.255.255.0:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Network:   10.23.2.0            00001010.00010111.00000010. 00000000

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Look at the bits that are not in bold. You can set any number of these bits to 1 to get a valid IP address in this subnet, with the exception of all 0s or all 1s.

Putting it all together, you can see how a host with an IP address of 10.23.2.1 and a subnet mask of 255.255.255.0 is on the same subnet as any other computers that have IP addresses beginning with 10.23.2. You can denote this entire subnet as 10.23.2.0/255.255.255.0.

Common Subnet Masks and CIDR Notation

If you're lucky, you'll only deal with easy subnet mask like 255.255.255.0 or 255.255.0.0, but you may be unfurtunate and encounter stuff like 255.255.255.192, where it isn't quite so simple to determine the set of addresses that belong to the subnet. Furthermore, it's likely that you'll also encounter a different form of subnet representation called Classless Inter-Domain Routing (CIRD) notation, where a subnet sych as 10.23.2.0/255.255.255.0 is written as 10.23.2.0/24.

To understand what this means, look at the mask in binary form (as in the example you saw in the preceding section). You'll find that nearly all subnet masks are just a bunch of 1s followed by a bunch of 0s. For ex.: you just saw that 255.255.255.0 in binary form is 24 1-bits followed by 8 0-bits. The CIDR notation identifies the subnet mask by the number of leading 1s in the subnet mask. Therefore, a combination such as 10.23.2.0/24 includes both the subnet prefix and its subnet mask.

Here's some example of subnet masks and their CIDR forms:

 Long Form:                 CIDR Form: 
 255.0.0.0                  8        
 255.255.0.0                16
 255.240.0.0                12
 255.255.255.0              24
 255.255.255.192            26

Note: If you aren't familiar with conversion between decimal, binary, and hexadecimal formats, you can use a calculator utility such as bc (or dc) to convert between different radix representations. For ex.: in bc, you can run the command ocase=2; 240 to print the number 240 in binary (bas2) form.

Identifying subnets and their hosts is the first building block to understanding how the Internet works. However, you still need to connect the subnets.

Routes and the Kernel Routing Table

Connecting Internet subnets is mostly a process of identifying the hosts connected to more than one subnet. Returning to Figure 8-2, think about Host A at IP address 10.23.2.4. This host is connected to a local network of 10.23.2.0/24 and can directly reach hosts on that network. To reach hosts on the rest of the Internet, it must communicate through the router at 10.23.2.1. How does Linux kernel distinguish between these two different kinds of destinations? It uses a destination configuration called a routing table to determine its routing behaviour. To show the routing table, use the route -n command. Here's what you might see for a simple host such as 10.23.2.4:

 $ route -n # or with ip command: ip route
 Kernel IP routing table
 Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
 0.0.0.0         10.23.2.1       0.0.0.0         UG    100    0        0 enp0s31f6
 10.23.2.0       0.0.0.0         255.255.255.0   U     100    0        0 enp0s31f6

The last two lines here contain the routing information. The Destination column tells you a network prefix, and the Genmask column is the netmask corresponding to that network. There are two networks defined in this output: 0.0.0.0/0 (which matches every address on the Internet) and 10.23.2.0/24. Each network has a U under it Flags column, indicating that the route is active ("up"). Where the destinations differ is in the combination of their Gateway and Flags columns. For 0.0.0.0/0, there is a G in the Flags column, meaning that communication for this network must be sent through the gateway in the Gateway column (10.23.2.1, in this case). However, for 10.23.2.0?24, there is no G in Flags, indicating that the network is directly connected in some way. Here, 0.0.0.0 is used as a stand-in under Gateway. Ignore the other columns of output now.

There's one tricky detail: Say the host want to send something to 10.23.2.132, which matches both rules in the routing table, 0.0.0.0/0 and 10.23.2.0/24. How does the kernel know to use the second one? It chooses the longest destination prefix that matches. This is where CIRD network form comes in particularly handy: 10.23.2.0/24 matches, and its prefix is 24 bits long, 0.0.0.0/0 also matches, but its prefix is 0 bits long (that is, it has no prefix), so the rule for 10.23.2.0?24 takes priority.

Note: The -n option tells route to show IP addresses instead of showing hosts and networks by name. This is an important option to remember because you'll be able to use it in other network-related commands such as netstat.

The Default Gateway

An entry for 0.0.0.0/0 in the routing table has special significance because it matches any address on the Internet. This is the default route, and the address configured under Gateway column (in the route -n output) in the default route is the default gateway. When no other rules match, the default route always does, and the default gateway is where you send messages when there is no other choice. You can configure a host without a default gateway, but it won't be able to reach hosts outside the destinations in the routing table.

Note: On most networks a netmask of 255.255.255.0, the router is usuall at address 1 of the subnet (for ex.: 10.23.2.1 in 10.23.2.0/24). Because this is simply a convention, there can be exeptions.

Basic ICMP and DNS Tools

Now it's time to look at some basic pratical utilities to help you interact with hosts. These tools use two protocols of particular interest: Internet Control Message Protocol (ICMP), which can help you root out problems with connectivity and routing, and the Domain Name Server (DNS) system, which maps names to IP addresses so that you don't have to remember a bunch of numbers.

ping

ping (see http://ftp.arl.mil/~mike/ping.html) is one of the most basic network debugging tools. It sends ICMP echo requests packets to a host that ask a recipient host to return the packet to the sender. If the recipient host gets the packet and is configured to reply, it sends an ICMP echo response packet in return.

For ex.: say that you rune ping 10.23.2.1 and get this output:

 $ ping 10.23.2.1
 PING 10.23.2.1 (10.23.2.1) 56(84) bytes of data.
 64 bytes from 10.23.2.1: icmp_seq=1 ttl=64 time=1.413 ms
 64 bytes from 10.23.2.1: icmp_seq=2 ttl=64 time=2.317 ms
 64 bytes from 10.23.2.1: icmp_seq=4 ttl=64 time=1.305 ms
 64 bytes from 10.23.2.1: icmp_seq=5 ttl=64 time=1.304 ms

The first line says that you're sending 56-byte packets (84 bytes, if you include the headers) to 10.23.2.1 (by default, one packet per second), and the remaining lines indicates responses from 10.23.2.1. The most important parts of the output are the sequence number (icmp_req) and the round-trip time (time). The number of bytes returned is the size of the packet sent plus 8. (The content of the packets isn't important to you.)

A gap in the sequence numbers, such as the one between 2 and 4, usually means there's some kind of connectivity problem. It's possible for packets to arrive out of order, and if they do, there's some kind of problem because ping sends only one packet a second. If a respons takes more than a second (1000ms) to arrive, the connection is extremly slow.

The round-trip time is the total elapsed time between the moment that the request packet leaves and moment that the response packet arrives. If there's no way to reach the destination, the final router to see the packet returns an ICMP "host unreachable" packet to ping.

On a wired LAN, you should expect absolutly no packet loss and very low number for the round-trip time. (The preceding example output is from a wireless network.) You should also expect no packet loss from your network to and from your ISP and reasonalby steady round-trip times.

Note: For security reasons, not all hosts on the Internet respond to ICMP echo request packets, so you might find that you can connect to a website on a host but not get a ping response.

traceroute

The ICMP-based program traceroute will come handy when you reach the material on routing later in this chapter. Use traceroute host to see the path your packets take to a remote host. (traceroute -n host will disable hostname lookups.)

One of the best things about traceroute is that it reports return trip times at each step in the route, as demonstrated in this output fragment:

 $ traceroute -n 8.8.8.8
 traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.250.254  0.301 ms  0.329 ms  0.367 ms
 2  192.168.0.1  2.295 ms  4.628 ms  5.309 ms
 3  * * *
 4  213.224.202.177  28.292 ms  27.079 ms  28.472 ms
 5  213.224.250.107  29.495 ms  28.845 ms  29.214 ms
 6  213.224.125.3  31.762 ms  41.569 ms  41.808 ms
 7  216.239.50.199  38.946 ms 216.239.63.187  42.598 ms  42.789 ms
 8  8.8.8.8  27.869 ms  25.556 ms  27.376 ms

Because this output shows a big latency jump betweenm hopw 6 and 7, that part of the route is probably some sort of long-distance link. The output from traceroute can be inconsistent. For ex.: the replies may time out at a certain step, only to "reappear" in later steps. The reason is usually that the router at that step refused to return the debugging output that traceroute wants but routers in later steps were happy to return the output. In addition, a router might choose to assign a lower priority to the debugging traffic than it does to normal traffic.

DNS and host

IP addresses are difficult to remember and subject to change, which is why we normally use names such as www.example.com instead. The DNS library on your system normally handles this translation automatically, but sometimes you'll want to manually translate between a name and an IP address. To find the IP address behind a domain name, use the host command:

 $ host www.example.com
 www.example.com has address 93.184.216.34
 www.example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946

Notice how this example has both the IPv4 address 93.184.216.34 and the much larger IPv6 address. This means that this host also has an addrss on the next-generation version of the Internet. You can also use host in reverse: Enter an IP address instead of a hostname to try to discover the hostname behind the IP address. But don't expect this to work reliably. Many hostnames can represent a single IP addrss, and DNS doesn't know how to determine which hostname should correspond to an IP address. The domain administrator must manually set up this reverse lookup, and often the administrator does not. (There is a lot more to DNS thant the host command. We'll cover basic client configuration later in Section 9.12.)

The Physical Layer and Ethernet

One of the key thing to understand about the Internet is that it's a software network. Nothing we've discussed so far is hardware specific, and indeed, one reason for the Internet's success is that it works on almost any kind of computer, OS, and physical network. However, you still have to put a network layer on top of some kind of hardware, and that interface is called the physical layer.

We'll look at the most common kinf of physical layer: an Ethernet network. The IEEE 802 family of standards documents defines many different kinds of Ethernet networks, from wired to wireless, but they all have a few things in common, in particular, the following:

 * All devices on an Ethernet network have a Media Access Control (MAC) address, sometime called a hardware address. This address is independent of a host's 
   IP address, and it is unique to the host's Ethernet network (but not necessaily a larger software network such as the Internet). A sample MAC address is 10:78:E4:78:97.  
 
 * Devices on an Ethernet network send messages in frames, which are wrappers around the data sent. A frame contains the origin and destination MAC addresses. 

Ethernet doesn't really attempt to go beyond hardware on a single network. For ex.: if you have two different Ethernet networks with one host attached to both networks (and two different network interface devices), you can't directly transmit a frame from one Ethernet network to the other unless you set up a special Ethernet bridge. And this is where higher network layers (such as Internet layer) come in. By convention, each Ethernet network is also usually an Internet subnet. Even though a frame can't leave one physical network, a router can take the data out of a frame, repackage it, and send it to a host on a different physical network, which is exactly what happens on the Internet.

Understanding Kernel Network Interfaces

The physical and the Internet layers must be connected in a way that allows the Internet layer to retain its hardware-independent flexibility. The Linux kernel maintains its own division between the two layers and provides communication standards for linking the called a (kernel) network interface. When you configure a netwrok interface, you link the IP address settings from the Internet side with the hardware identification on the physical device side. Network interfaces have names that usually indicate the kind of hardware underneath, such as eth0 (the first Ethernet card in the computer) and wlan0 (a wireless interface). In section 9.3.1, you learned the most important command for viewing or manually configuring the network interface settings: ifconfig. Recall this output:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 enp0s31f6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 10.23.2.4  netmask 255.255.255.0  broadcast 10.23.2.255
         inet6 fe80::30a9:9625:4298:4e3b  prefixlen 64  scopeid 0x20<link>
         ether 38:d5:47:1b:ae:b4  txqueuelen 1000  (Ethernet)
         RX packets 324876  bytes 163980102 (156.3 MiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 327267  bytes 107707100 (102.7 MiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
         device interrupt 16  memory 0xf7200000-f7220000  

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

For each netwrok interface, the left side of the output shows the interface name, and the right side contains settings and statistics for the interface. In addition to the Internet layer pieces that we've already covered, you also see the MAC address on the physical layer (HWaddr). The lines containing UP and RUNNING tell you that the interface is working. Although ifconfig shows some hardware information (in this case, even sone low-level device settings such as the interrupt and memory used), it's designed primarily for viewing and configuring the software layers attached to the interfaces. To dig deeper into the hardware and physical layer behind a network interface, use something like the ethtool command to display or change the settings on Ethernet cards. (We'll look briefly at wireless networks in Section 9.23.)

Introduction to Network Interface Configuration

You've now seen all of the basic elements that go into the lower levels of a network stack: the physical layer, the network (Internet) layer, and the Linux kernel's network interfaces. In order to combine these pieces to connect a Linux machine to the Internet, you or a piece of software must do the following:

 1. Connect the network hardware and ensure that the kernel has a driver for it. If the driver is present, ifconfig -a displays a kernel network interface corresponding to the hardware. 
 2. Perform any additional physical layer setup, such as choosing a network name or password. 
 3. Bind an IP address and netmask to the kernel network interface so that the kernel's device drivers (physical layer) and Internet subsystems (Internet layer) can talk to each other. 
 4. Add any additional necessary routes, including the default gateway. 

When all machines were big stationary boxes wired together, this was relatively straighforward: The kernel did step 1, you didn't need step 2, and you'd do step 3 with the ifconfig command and step 4 with the route command. To manually set the IP address and netmask for a kernel network interface, you'd do this:

 # ifconfig interface address netmask mask

Here, interface is the name of the interface, such as eth0. When the interface was up, you'd be ready to add routes, which was typically just a matter of setting the default gateway, like this:

 # route add default gw gw-address

The gw-address parameter is the IP address of your default gateway; it must be an address in a locally connected subnet defined by the address and mask settings of one of your network interfaces.

Manually Adding and Deleting Routes

To remove a default gateway, run

 # route del -net default

You can easily override the default gateway with other routes. For ex.: say your machine is on subnet 10.23.2.0/24, you want to reach a subnet at 192.168.45.0/24, and you know that 10.23.2.44 can act as a router for that subnet. Run this command to send traffic bound for 192.168.45.0 to that router:

 # route add -net 192.148.44.0/24 gw 10.23.2.44

You don't need to specify the router in order to delete a route:

 # route del -net 192.168.44.0/24

Now, before you go crazy with routes, you should know that messing with routes is often more complicated that it appears. For this particular example, you also have to make sure that the routing for all hosts on 192.168.45.0?24 can lead back to 10.23.2.0/24, or the first route you add is basically useless.

Normally, you should keep things as simple as possible for your clients, setting up networks so that their hosts need only a default route. If you need multiple subnets and the ability to route between them, it's usually best to configure the routers acting as the default gateways to do all of the work of routing between different local subnets. (You'll see an example in Section 9.17.)

Boot-Activated Network Configuration

We've discussed ways to manually configure a network, and the traditional way to ensure the correctness of a machine's network configuration was to have init run a script to run manual configuration at boot time. This boils down to running tools like ifconfig and route somewhere in the chain of boot events. Many server still do it this way.

There have been many attempts in Linux to standardize configuration file for boot-time networking. The tool ifup and ifdown do so —— for ex.: a boot script can (in theory) run ifup eth0 to run the correct ifconfig and route commands for the eth0 interface. Unfurtunately, different distributions have completely different implementations of ifup and ifdown, and as result, their configuration files are also completely different. Ubuntu, for ex.: uses the ifupdown suite with configuration in /etc/network, and Fedora uses its own set of scripts with configuration in /etc/sysconfig/network-scripts.

You don't need to know the details of these configuration files, and if you insist on doing it all by hand and bypass your distribution's configuration tools, you can just look up the formats in manual pages such as ifup(8) and inetrfaces(5). But it is important to know that this type of boot-activated configuration is often not even used. You'll most often see it for the localhost (or lo) network interface but nothing else because it's too inflexible to meet the needs of modern systems.

Problems with Manual and Boot-Activated Network Configuration

Although most systems used to configure the network in their boot mechanisme —— and many servers still do —— the dynamic nature of modern networks means that most machines don't have static (unchanging) IP addresses. Rather than stroring the IP address and other network information on your machine, your machine gets this information from somewhere on the local physical network when it first attaches to that network. Most normal network client applications don't particulary care what IP address your machine uses, as long as it works. Dynamic Host Configuration Protocol (DHCP, described in Section 9.160) tools do the basic network layer configuration on typical clients.

There's more to the story, though. For ex.: wireless networks add additional dimensions to interface configuration, such as network names, authentification, and encryption techniques. When you step back to look at the bigger picture, you see that your system needs a way to answer the following questions:

 * If the machine has multiple physical network interfaces (such as a notebook with wired and wireless Ethernet), how do you choose which one(s) to use? 
 * How should the machine set up the physical interface? For wireless networks, this includes scanning for network names, choosing a name, and negotiating authentification. 
 * Once the physical network interface is connected, how should the machine set up the software network layers, such as the Internet layer?
 * How can you let a user choose connectivity options? For ex.: how do you let a user choose a wireless network? 
 * WHat should the machine do if it loses connectivity on a network interface? 

Answering these question is usually more than a simple boot scipts can handle, and it's a real hassle to do it all by hand. The answer is to use a system service that can monitor physical networks and choose (and automatically configure) the kernel network interfaces based on a set of rules that makes sense to the user. The service should als be able to respond to requests from users, who should be able to change the wirless network they're on without habing to become root just to fiddle arround with network settings every time something changes.

Network COnfiguraiton Managers

There are several ways to automatically configure networks in Linux-based systems. The most widely used option on desktops and notebooks is NetworkManager. Other network configuration management system are mainly targeted for smaller embedded systems, such as OpenWRT's netifd, Android's Connectivity Manager service, ConnMan, and Wicd. We'll briefly discuss NetworkManager because it's the one you're most likely to encounter. We won't go into a tremendous amount of detail, though, because after you see the big picture, NetworkManager and other configuration systems will be more transparent.

NetworkManager Operation

NetworkManager is a daemon that the system starts upon boot. Like all daemons, it does not depend on a running desktop component. Its job is to listen to events from the system and users and to change the network ocnfiguration based on a bunch of rules. When running, NetworkManager maintains two basic levels of configuration. The first is a collection of information about available hardware devices, which it normally collects from the kernel and maintains by monitoring udev over the Desktop Bus (D-Bus0). The second configuration level is a more specific list of connections: hardware devices and additional physical and network layer configuration parameters. For ex.: a wireless network can be represented as a connection.

To activate a connection, NetworkManager often delegates the tasks to other specialized network tools and daemons such as dhclient to get Internet layer configuration from a locally attached physical network. Because network configuration tools and schemes vary among distributions, NetworkManager uses plugins to interface with them, rather than imposing its own standard. There are plugins for the both the Debian/Ubuntu and Red Hat-style interface configuration, for example.

Upon startup, NetworkManager gathers all available network device information, searches its list of connections, and then decides to try to activate one. Here's how it makes that decision for Ethernet interfaces:

 1. If a wired connection is available, try to connect using it. Otherwise, try the wireless connections. 
 2. Scan the list of available wireless networks. If a network is available that you're previously connected to, NetworkManager will try it again. 
 3. If more than one previously connected wireless networks are available, select the most recently connected. 

After establishing a connection, NetworManager maintains it until the connection is lost, a better network becomes available (for ex.: you plug in a network cable while connected over wireless), or the user forces a change.

Interacting with NetworkManager

Most users interact with NetworkManager through an applet on the desktop —— it's usually an icon in the upper or lower right that indicates the connection status (wired, wirelessm or not connected). When you click on the icon, you get a number of connectivity options, such as a choice or wireless networks and an option to disconnect from your current network. Each desktop environment has its own version of this applet, so it looks a little different on each one.

In addition to the applet, there are few tools that you can use to query and control NetworkManager from your shell. For a very quick summary of your current connection status, use the nm-tool command with no arguments. You'll get a list of interfaces and configuration parameters. In some ways, this is like ifconfig except that there's more detail, especially when viewing wireless connections.

 $ nm-tool

To control NetworkManager fromt the command line, use the nmcli command. This is a somewhat extensive command. See the nmcli(1) manual page for more information. Finnaly, the utility nm-online will tell you whether the network is up or down. If the network is up, the command returns zero as its exit code; it's nonzero otherwise. (For more on how to use an exit code in a shell script, see Chapter 11.)

NetworkManager Configuration

The general configuration directory for NetworkManager is usually /etc/NetworManager, and there are several different kinds of configuration. The general configuration file is NetworkManager.conf. The format is similar to the XDG-style .desktop and Microsoft .ini files, with key-value parameters falling into different sections. You'll find that nearly every configuration file has a [main] section that defines the plugins to use. Here's an example that activates the ifupdown plugin used by Ubuntu and Debian:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [main]
 plugins=ifupdown,keyfile

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Other distribution-secific plugins are ifcfg-rh (for Red Hat-style distributions) and ifcfg-suse (for SuSE). The keyfile plugin that you also see here supports NetworkManager's native configuration file support. When using the plugin, you can see the system's known connections in /etc/NetworkManager/system-connections.

For the most part, you won't need to change NetworkManager.conf because the more specific configuration options are found in other files.

Unmanaged Interfaces

Although you may want NetworkManager to manage most of your network interfaces, there will be times when you want it to ignore interfaces. For ex.: there's no reason why most user would need any kind of dynamic configuration on the localhost (lo) interface because the configuration never changes. You also want to configure this interface early in the boot process because basic system services often depend on it. Most distributions keep NetworkManager away from localhost. You can tell NetworkManager to disregard an inteface by using plugins. If you're using ifupdown plugin (for ex.: in Ubuntu and Debian), add the interface configuration to your /etc/network/interfaces file and then set the value of managed to false in the ifupdown section of the NetworkManager.conf file:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [ifupdown]
 managed=false 

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

For the ifcfg-rh plugin, look for line like this in the /etc/sysconfig/network-scripts directory that contains the ifcfg-* configuration file:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 NM_CONTROLLED=yes

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

If this line is not present or the value is set to no, NetworkManager ignores the interface. For ex.: you'll find it deactivated in the ifcfg-lo file. You can also specify a hardware address to ignore, like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 HWADDR=10:78:d2:eb:76:97

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

If you don't use either of these network configuration schemes, you can still use the keyfile plugin to specify the unmanaged device directly inside your NetworkManager.conf file using the MAC address. Here's how that might look:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 [keyfile]
 unmanaged-devices=mac:10:78:d2:eb:76:97;mac:14:78:ff:cb:76:99

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Dispatching

One final detail of NetworkManager configuration relates to specifiying additional system actions for when a network interface goes up or down. For ex.: some network daemons need to know when to start or stop listening on an interface in order to work correctly (such as the secure shell daemon discussed in the next chapter).

When the network interface status on a system changes, NetworkManager runs everything in /etc/NetworkManager/dispatcher.d with an argument such as up or down. This is relatvely straightforward, but many distributions have their own network control scripts so they don't place the individual dispatcher scripts in this direcotry. Ubuntu, for ex.: has just one script named 01ifupdown that runs everyhting in an appropriate subdirecotry of /etc/network, such as /etc/network/if-up.d.

As with the rest of the NetworkManager configuration, the details of these scripts are relatively unimportant; all you need to know is how to track down the appropriate location if you need to make an addition or change. As ever, don't be shy about looking at scripts on your system.

Resolving Hostnames

One of the final basic tasks in any network configuration is hostname resolution with DNS. You've already seen the host resolution tool that translates a name such as www.example.com to an IP address. DNS differs from the network elements we've looked at so far because it's in the application layer; entirely in user space. Technically, it is slightly out of space in this chapter alongside the Internet and physical layer discussion, but without proper DNS configuration, your Internet connection is pratically worthless. No one in their right mind advertise IP addresses for websites and email addresses because a host's IP address is subject to change and it's not easy to remember a bunch of numbers. Automatic network configuration services such as DHCP nearly always include DNS configuration.

Nearly all network applications on a Linux system perform DNS lookups. The resolution process is typically unfolds like this:

 1. The application calls a function to look up the IP address behind a hostname. This function is in the system's shared library, so the application doesn't need to know the details of how it works or 
    wheter the implementation will change. 
 2. When the function in the shared library runs, it acts according to a set of rules (found in /etc/nsswitch.conf) to determine a plan of action on lookups. For ex.: the rules usually say that even
    before going to DNS, check for a manual override in the /etc/hosts file. 
 3. When the function decides to use DNS for the name lookup, it consults an additional configuration file to find a DNS name server. The name server is given as an IP address. 
 4. The function sends a DNS lookup request (over the network) to the name server. 
 5. The name server replies with the IP address for the hostname, and the function returns this IP address to the application. 

This is the simplified version. In a typical modern system, there are more actors attempting to speed up the transaction and/or add flexibility. Let's ignore that for now and take a closer look at the basic pieces.

/etc/hosts

On most systems, you can override hostname lookups with the /etc/hosts file. It usually looks like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 127.0.0.1	localhost
 127.0.1.1	game game.oswincorp.pw
 10.23.2.4     media media.oswincorp.pw

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

You'll nearly always see the entry for localhost here.

Note: In the bad old days, there was one central host file that everyone copied to their own machine in order to stay up-to-date (see RFCs 606, 608, 623, and 625), but as the ARPANET/Internet grew, this quickly got out of hand.

resolv.conf

The traditional configuration file for DNS servers is /etc/resolv.conf. When things were simpler, a typical example might have looked like this, where the ISP's name server addresses are 10.32.45.23 and 10.3.2.3:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 search mydomain.example.com example.com
 nameserver 10.32.45.23
 nameserver 10.3.2.3

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The search line defines rules for incomplete hostnames (just the first part of the hostname; for ex.: myserver instead of myserver.example.com). Here, the resolver library would try to look up host.mydomain.example.com and host.example.com. But things are usually no longer this straightforward. Many enchancements and modifications have been made to the DNS configuration.

Caching and Zero-Configuration DNS

There are two main problems with the traditional DNS configuration. First, the local machine does not cache name server replies, so frequent repeated network access may be unnecessarily slow due to name server requests. To solve this problem, many machines (and routers, if acting as name servers) run an intermediate daemon to intercept name server requests and return a cached answer to name service requests if possible; otherwise, requests go to a real name server. Tow of the most common daemons for Linux are dnsmasq and nscd. You can also set up BIND (the standard Unix name server caching daemon) as a cache. You can often tell if you're running a name server cacheing daemon when you see 127.0.0.1 (lo) in your /etc/resolv.conf file or when you see 127.0.0.1 show up as the server if you run nslookup -debug host. It can be tricky to track down your configuration if you're running a name server-caching daemon. By default, dnsmasq has the configuration file /etc/dnsmasq.conf, but your distribution may override that. For ex.: in Ubuntu, if you've manually set up an interface that's set up by NetworkManager, you'll find in the appropriate file in /etc/NetworkManager/system-connections because when NetworkManager activates a connection, it also starts dsnmasq with that configuration. (You can override all of this by uncommenting the dnsmasq part of your NetwrokManager.conf.)

The other problem with the traditional name server setup is that it can be particularly inflexible if you want to be able to look up names on your local network without messing around with a lot of network configuration. For ex.: if you set up a network appliance on your network, you'll want to be able to call it by name immediately. This is part of the idea behind zero-configuration name service system such as Multicats DNS (mDNS) and Simple Service Discovery Protocol (SSDP). If you want to find a host by name on the local network, you juts broadcast a request over the network; if the host is there, it replies with its address. These protocols go beyond hostname resolution by also providing information about available services. The most widly used Linux implementation of mDNS is called Avahi. You'll often see mdns as a resolver option in /etc/nsswitch.conf, which we'll now look at in more detail.

/etc/nsswitch.conf

The /etc/nsswitch.conf file controls several name-related precedence settings on your system, such as user and password information, but we'll only talk about the DNS settings in this chapter. The file on your system should have a line like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 hosts:                  files dns

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Putting files ahead of dns here ensures that your system checks the /etc/hosts file for the hostname of your requested IP address before askinf the DNS server. This is usually a good idea (especially for looking up localhost, as discussed below), but your /etc/hosts file should be as short as possible. Don't put anything in there to boost performance; doing so will burn you later. You can put all hosts within a small private LAN in DNS entry, it has no place in /etc/hosts. (The /etc/hosts file is also useful for resolving hostnames in the early stages of booting, when the network may bot be available.)

Note: DNS is a broad topic. If you have any responsability for domain names, read DNS and BIND, 5th edition.

Localhost

When running ifconfig, you'll notice the lo interface:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
         inet 127.0.0.1  netmask 255.0.0.0
         inet6 ::1  prefixlen 128  scopeid 0x10<host>
         loop  txqueuelen 1  (Local Loopback)
         RX packets 3223889  bytes 183830538 (175.3 MiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 3223889  bytes 183830538 (175.3 MiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The lo interface is a virtual network interface called the loopback because it loops back itself. The effect is that connecting 127.0.0.1 is connecting to the machine that you're currently using. When outgoing data to localhost reaches the kernel network interface for lo, the kernel just repackages it as incoming data and sends it back through lo.

The lo loopback interface is often the only place you'll see static network configuration in boot-time scripts. For ex.: Ubuntu's ifup command reads /etc/network/interfaces and Fedoras uses /etc/sysconfignetwork-interfaces/ifcfg-lo. You can often find the loopback device configuration by digging around in /etc with grep.

The Transport Layer: TCP, UDP, and Services

So far, we've only seen how packets move from host to host on the Internet —— in other words, the where question from the beginning of the chapter. Now let's start to answer the what question. It's important to know how your computer presents the packet data it receives from other hosts to its running processes. It's difficult and inconvenient for user-space programs to deal with a bunch of raw packets the way that the kernel can. Flexibility is especially important: More than one application should be able to talk to the network at the same time (for ex.: you might have email and several web clients running). Transport layer protocols bridge the gap between the raw packets of the Internet layer and the refined needs of applications. The two most popular transport protocols are the Tansmission Control Protocol (TCP) and the User datagram Protocol (UDP). We'll concentrate on TCP because it's by far the most common protocol in use, but we'll also take a quick look at UDP.

TCP Ports and Connections

TCP provides for multiple network applications on one machine by means of network ports. A port is just a number. If an IP address is like the postal address of an appartement building, a port is like a mailbox number —— it's a further subdivision. When using TCP, an application oopens a connection (not to be confused with NetworkManager connections) between one port on its own machine and a port on a remote host. For ex.: an application such as web browser could open a connection between port 36406 on its own machine and port 80 on a remote host. From the application's point of view, port 36406 is the local port and port 80 is the remote port. You can identify a connection by using the pair of IP address and port numbers. To view the connections currently open on your machine, use netstat. Here's an example that shows TCP connections: the -n option disables hostname (DNS) resolution, and <code -t limits the output to TCP.

 $ netstat -nt 
 Active Internet connections (w/o servers)
 Proto Recv-Q Send-Q Local Address           Foreign Address         State      
 tcp        0      0 10.23.2.4:44841         138.201.81.199:5222     ESTABLISHED
 tcp        1      1 10.23.2.4:50414         51.255.197.8:80         ESTABLISHED
 tcp        1      1 10.23.2.4:50432         51.255.197.8:80         ESTABLISHED

The Local Address and Foreign Address fields show connections from your machine's point of view, so the machine here has an interface configured at 10.23.2.4, and ports 44841, 50414 and 50432 on the local side are all connected. The first connection here shows port 44481 connected to port 5222 of 138.201.81.199

Establishing TCP Connections

To establish a transport layer connection, a process on one host initiates the connection from one of its local ports to a port on a second host with a special series of packets. In order to reconginze the incomming connection and respond, the second host must have a process listening on the correct port. Usually, the connecting process is called the client, and the listener is called the server. The important thing to know about the ports is that the client picks a port on its side that isn't currently is use, but it nearly always connects to some well-known port on the server side. Recall this output from the netstat command in the preceding section:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Proto Recv-Q Send-Q Local Address           Foreign Address         State      
 tcp        0      0 10.23.2.4:44841         138.201.81.199:5222     ESTABLISHED

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

With a little help, you can see that this connection was probably initiated by a local client to a remote server because the port on the local side (44841) looks like a dynamically assigned number, whereas the remote port (80) is a well-known service (the Jabber on XMPP messaging service. to be specific).

Note: A dynamically assigned port is called an ephemeral port.

However, if the local port in the output is well-known, a remote host probably initiated the connection. In this example, remote host 172.24.54.243 has connected to port 80 (the default web port) on the local host.

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Proto Recv-Q Send-Q Local Address           Foreign Address         State      
 tcp        0      0 10.23.2.4:80            172.24.54.2439:43035    ESTABLISHED

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

A remote host connecting to your machine on a well-known port implies that a server on your local machine is listening on this port. To confirm this, list all TCP ports that your machine is listening on with netstat:

 $ netstat -ntl
 Active Internet connections (only servers)
 Proto Recv-Q Send-Q Local Address           Foreign Address         State      
 tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN     
 tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN       
 tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN    
 tcp        0      0 127.0.0.1:53            0.0.0.0:*               LISTEN
 --snip--

The line with 0.0.0.0:80 as the local address shows that the local machine is listening on port 80 for connections from any remote machine. (A server can restrict the access to certain interfaces, as shown in the last line, where something is listening for connections only on the localhost interface.) To learn more, use lsof to identify the specific process that's listening.

Here's an example of lsof usage:

 $ lsof -i :portNumber 
 $ lsof -i tcp:portNumber 
 $ lsof -i udp:portNumber 
 $ lsof -i :80
 $ lsof -i :80 | grep LISTEN

Port Numbers and /etc/services

How do you know if a port is a well-known port? There's no single way to tell, but one good place to start is to look in /etc/services, which translates well-known port numbers into names. this is a plaintext file. You should see entries like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 ssh		22/tcp
 smtp		25/tcp
 domain        53/udp

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The first column is a name and the second column indicates the port number and the specific transport layer protocol (which can be other than TCP).

Note: In addition to /etc/services, an online registry for ports at http://www.iana,org/ is governed by RFC6335 network standards document.

On Linux, only processes running as the superuser can use ports 1 through 1023. All user processes may listen on and create connections from ports 1024 and up.

Characteristics of TCP

TCP is popular as a transport layer protocol because it requires relatively little from the application side. An application process only needs to know how to open (or listen for), read from, write to, and close a connection. To the application, it seems as if there are incomming and outgoing streams of data; the process is nearly as simple as working with a file. However, there's a lot of work to do behind the scenes. For one, the TCP implementation needs to know how to break an outgoing data stream from a process into packets. However, the hard part is knowing how to convert a series of incoming packets into an input data stream for process to read, especially when incoming packets don't necesarily arrive in the correct order. In addition, a host using TCP must check for errors: Packets can get lost or mangled when sent across the Internet, and a TCP implementation must detect and correct these situation. Figure 9-3 shows a simplification of how a host might use TCP to send a message. Luckily, you need to know next nothing about this mess other than that the Linux TCP implementation is primarly in the kernel and that utilities that work with the transport layer tend to manipulate kernel data structures. One example is the IP Tables packet-filtering system discussed in Section 9.21.

UDP

UDP is a far simpler transport layer than TCP. It defiens a transport only for single messages; there is no data stream. At the same time, unlike TCP, UDP won't correct for lost or out-of-order packets. In fact, although UDP has ports, it doesn't even have connections! One host simply sends a message from one of its ports to a port on a server, and the server sends something back if it wants to. However, UDP does have error detection for data inside a packet; a host can detect if a packet gets mangled, but it doesn't have to do anything about it.

Where TCP is like having a telephone conversation, UDP is like sending a letter, telegram, or instant message (except that instant messages are more reliable). Applications that use UDP are often concerned with speed —— sending message as quickly as possible. They don't want the overhead of TCP because they asume the network between two hosts in generally reliable. They don't need TCP error correction because they either have their own error detection systems or simply don't care about errors.

One example of an application that uses UDP is the Network Time Protocol (NTP). A client sends a short and simple request to a server to get the current time, and the response from the server is equally brief. Because the client wants the response as quickly as possible, UDP suits the application; if the response from the server gets lost somewhere in the network, the client can just resend a request or give up. Another example is video chat —— in this case, pictures are sent with UDP —— and if some pieces get lost along the way, the client on the receiving end compensates the best it can.

Figure9-3.png

Revisiting a Simple Local Network

We're now going to look at additional components of the simple network introduced in Section 9.3. Recall that this network consists of one local area network as one subnet and a router that connects the subnet to the rest of the Internet. You'll learn the following:

 * How a host on the subnet automatically gets its network configuration. 
 * How to set up routing.
 * What a router really is. 
 * How to know which IP addresses to use for the subnet. 
 * How to set up firewalls to filter out unwanted traffic from the Internet. 

Let's start by learning how a host on the subnet automatically gets its network configuration.

Understanding DHCP

When you set a network host to get its configuration automatically from the network, you're telling it to use the Dynamic Host Configuration Protocol (DHCP) to get an IP address, subnet mask, default gateway, and DNS servers. Aside from not having to enter these parameters by hand, DHCP has other advantages for a network administrator, such as preventing IP address clashes and minimizing the impact of network changes. It's very rare to see a modern network that doesn't use DHCP.

For a host to get its configuration with DHCP, it must be able to send messages to a DHCP server on its connected network. Therefore, each physical network should have its own DHCP server, and on a simple network (such as the one in Section 9.3), the router usually acts as the DHCP server.

Note: When making an initial DHCP request, a host doesn't even know the address of a DHCP server, so it broadcast the request to all host (usually all hosts on its physical network).

When a machine asks a DHCP server for an IP address, it's really asking for a lease on an address for a certain amount of time. When the lease is up, a client ask to renew the lease.

The Linux DHCP Client

Although there are many different kinds of network manager systems, nearly all use the Internet Software Consortium (ISC) dhclient program to do the actual work. You can test dhclient by hand on the command line, but before doing so you must remove any default gateway route. To run the test, simply specify the network interface name (here, it's eth0):

 # dhclient eth0 

Upon startup, dhclient stores its process ID in /var/run/dhclient.pid and its lease information in /var/state/dhclient.leases.

Linux DHCP Servers

You can task a Linux machine with running a DHCP server, which provides a good amount of control over the addresses that it gives out. However, unless you're administering a large network with many subnets, you're probably better off using specialized router hardware that includes built-in DHCP servers. Probably the most important thing to know about DHCP servers is that you want only one running on the same subnet in order to avoid problems with clashing IP addresses or incorrect configurations.

Configuring Linux as a Router

Routers are essentially just computers with more than one physical network interface. You can easily configure a Linux machine as a router. For ex.: say you have two LAN subnets, 10.23.2.0/24 and 192.168.45.0/24. To connect them, you have a Linux router machine with three network interfaces: two for the LAN subnets and one for an Internet UPlink, as shown in Figure 9-4. As you can see, this doesn't look very different from the simple network example that we've used in the rest of this chapter.

Figure9-4.png

The router's IP address for the LAN subnets are 10.23.2.1 and 192.168.45.1. When those addresses are configured, the routing table looks something like this (the interface names might vary in practice; ignore the Internet UPlink for now):

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Kernel IP routing table
 Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
 10.23.2.0       0.0.0.0         0.0.0.0         UG    0    0          0 eth0
 192.168.45.0    0.0.0.0         0.0.0.0         UG    0    0          0 eth1

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Now let's say that the hosts on each subnet have the router as their default gateway (10.23.2.1 for 10.23.2.0/24 and 192.168.45.1 for 192.168.45.0/24). If 10.23.2.4 wants to send a packet to anything outside of 10.23.2.0/24, it passes the packet to 10.23.2.1. For ex.: to send a packet from 10.23.2.4 (Host A) to 192.168.45.61 (Host F), the packet goes to 10.23.2.1 (the router) via its eth0 interface, then back out through the router's eth1 interface. However, by default, the Linux kernel does not automatically move packets from one subnet to another. To enable this basic routing function, you need to enable IP forwarding in the router's kernel with this command:

 # systemctl -w net.ipv4.ip_forward

As soon as you enter this command, the machine should start routing packets between the two subnets, assuming that the hosts on those subnets know to send their packets to the router you just created. To make this change permanent upon reboot, you can add it to your /etc/sysctl.conf file. Depending on your distribution, you may have the option to put it into a file in /etc/sysctl.d so that distribution updates won't overwrite your changes.

Internet Uplinks

When the router also has the network interface with an Internet uplink, this same setup allows Internet access for all hosts on both subnets because they're configured to use the router as the default gateway. But that's where things get more complicated. The problem is that certain IP addresses such as 10.23.2.4 are not actually visible to the whole Internet; they're on so-called private networks. To provide for Internet connectivity, you must set up a feature called Network Address Transation (NAT) on the router. The software on nearly all specialized routers does this, so there's nothing out of the ordinay here, but let's examine the problem of private networks in a bit more detail.

Private Networks

Say you decide to build your own network. You have your machines, router, and network hardware ready. Given what you know about a simple network so far, your next questions is "What IP subnet should I use?" If you want a block of Internet addresses that every host on the Internet can see, you can buy one from your ISP. However, because the range of IPV4 addresses is very limited, this costs a lot and isn't useful for much more than running a server that the rest of the Internet can see. Most people don't really need this kind of service because they access the Internet as a client. The conventional, inexpensive alternative is to pick a private subnet from the addresses in the RFC 1918/6761 Internet standards documents:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Network       Subnet Mask      CIDR Form

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 10.0.0.0      255.0.0.0        10.0.0.0/8
 192.168.0.0   255.255.0.0      192.168.0.0/16
 172.16.0.0    255.240.0.0      172.16.0.0/12

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

You can carve up private subnets as you wish. Unless you plan to have more than 254 hosts on a single network, pick a small subnet like 10.23.2.0/24, as we've been using throughout this chapter. (Networks with this netmask are sometimes called class C subnets. Although the term is technically somewhat obsolete, it's still useful.)

What's the catch? Hosts on the real Internet know nothing about private subnets and will not send packets to them, so without some help, hosts on private subnets cannot talk to the outside world. A router connected to the Internet (with a true, nonprivate address) needs to have some way to fill in the gap between that connection and the hosts on a private network.

Network Address Translation (IP Masquerading)

NAT is the most commonly used way to share a single IP address with a private network, and it's nearly universal in home and small office networks. In Linux, the variant of NAT that most people use is known as IP masquerading. The basic idea behind NAT is that the router doesn't just move packets from one subnet to another; it transforms them as it moves them. Hosts on the Internet know how to connect to the router, but they know nothing about the private network behind it. The hosts on the private network need no special configuration; the router is their default gateway. The system works roughly like this:

 1. A host on the internal private network wants to make a connection to the outside world, so it sends its connection request packets throuhg the router. 
 2. The router intercepts the connection request packet rather than passing it out to the Internet (where it would get lost because the public Internet knows nothing about private networks). 
 3. The router determines the destination of the connection request packet and opens its own connection to the destination. 
 4. When the router obtains the connection, it fakes a "connection established" message back to the original host. 
 5. The router is now the middleman between the internal host and the destination. The destination knows nothing about the internal host; the connection on the remote host looks like it came from the 
    router. 

This isn't quite as simple as it sounds, Normal IP routing knows only source and destination IP addresses in the Internet layer. However, if the router dealt only with the Internet Layer, each host on the internal network could establish only one connection to a single destination at one time (among other limitations), because there is no information in the Internet layer part of a packet to distinguish multiple requests from the same host to the same destination. Therefore, NAT must go beyond the Internet Layer and dissect packets to pull out more identifying information, particularly the UDP and TCP port numbers from the transport layers. UDP is fairly easy because there are port but no connections, but the TCP transport layer is complex.

In order to set up a Linux machine to perform as a NAT router, you must activate all of the following inside the kernel configuration: network packet filtering ("firewall support"), connection tracking, IP tables support, full NAT, and MASQUERADE target support. Most distribution kernels come with this support.

Next you need to run some complex-looking iptables command to make the router perform NAT for its private subnet. Here's an example that applies to an internal Ethernet network on eth1 sharing an external connection at eth0 (you'll learn more about iptables syntax in Section 9.21):

 # systcl -w net.ipv4.ip_forward
 # iptables -P FORWARD DROP
 # iptables -t nat -A POSTROUTING -o eth0 -j MASQEURADE
 # iptables -A FORWARD -i eth0 -o eth1 -m state --state ESTABLISHED,RELATED -j ACCEPT
 # iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT

Note: Although NAT works well in practice, remember that it's essentially a hack used to extend the lifetime of the IPv4 address space. In a perfetc world, we would all be using IPv6 (the next generation Internet) and using its larger and more sophisticated address space without any pain.

You likely won't ever need to use the commands above unless you're developing your own software, especially with so much special-purpose router hardware available. But the role of Linux in a network doesn't end here.

Routers and Linux

In the early days of broadband, users with less demanding needs simply connected their machine directly to the Internet. But it didn't take too long for many users to want to share a single broadband connection with their own networks, and Linux users in particular would often set up an extra machine to use ad a router running NAT. Manufacturers responded to this new market by offering specialized router hardware consisting of an efficient processor, some flash memory, and several network ports —— with enough power to manage a typical simple network, run important software such as DHCP server, and use NAT. When it came to software, many manufacturers turned to Linux to power their routers. They added the necessary kernel features, stripped down the user-space software, and created GUI-based administration interfaces. Almost as soon the first of these routers appeared, many people became interested in digging deeper into the hardware. One manufacturer, Linksys, was required to release the source code for its software uner the terms of the license of one its components, and soon specialized Linux ditributions such as OpenWRT appeared for routers. (The "WRT" in these names came from the Linksys model number.) Aside from the hobbyisy aspect, there are good reasons to use these ditributions: They're often more stable than the manufacturer firmware, especially on older router hardware, and they typically offer aditional features. For ex.: to bridge a network with a wireless connection, many maufacturers require you to buy matching hardware, but wuth OpenWRT installed, the manufacturer and age of the hardware don't really matter. This is because you're using a trult open OS on the router that doesn't care what hardware you use as long as your hardware is supported.

You can use much of the knowledge learn to examine the internals of custom Linux firmware, thouhg you'll encounter differences, especially when logging in. As with many embedded systems, open firmware tends to use BusyBox to provide many shell features. BusyBox is a single executable program that offers limited functionality for many Unix commands such as the shel, ls, grep, cat, and more. (This saves a significant amount of memory.) In addition, the boot-time init tends to be very simple on embedded systems. However, you typically won't find these limitations to be a problem, because custom Linux firmware often includes a web administration interface similar to what you'd see from a manufacturer.

Firewalls

Routers in particular should always include some kind of firewall to keep undesirable traffic out of your network. A firewall is a software and/or hardware configuration that usually sits on a router between the Internet and a smaller network, attempting to ensure nothing "bad" from the Internet harms the smaller network. You can also set up firewall features for each machine where the machine screens all off its incoming and outgoing data at the packet level (as opposed to the application layer, where server programs usually try to perform some access control of their own). Firewalling on individual machines is sometimes called IP filtering.

A system can filter packets when it:

 * receives a packet, 
 * sends a packet, or
 * forwards (routes) a packet to another host or gateway. 

With no firewalling in place, a system just process packets and sens them on their way. Firewalls put checkpoints for packets at the points of data transfer identified above. The checkpoints drop, reject, or accept packets, usually based on some of these criteria:

 * The source or destination IP address or subnet
 * The source or destination port (in the transport layer information)
 * The firewall's network interface

Firewalls provide an opportunity to work with the subsystem of the Linux kernel that process IP packets. Let's look at that now.

Linux Firewall Basics

In Linux, you create firewall rules in a series known as a chain. A set of chains makes up a table. As a packet moves through the various parts of the Linux networking subsystem, the kernel applies the rules in certain chains to the packets. For ex.: after receiving a new packet from the physical layer, the kernel activates rules in chains corresponding to input. All of these data structures are maintained by the kernel. The whole system is called iptables, with an iptables user-space command to create and manipluate the rules.

Note: There is a newer system called nftables that has a goal of replacing iptables, but still iptables is the dominant system for firewalls.

Because there can be many tables —— each with their own sets of chains, each of which can contain many rules —— packet flow can become quite complicated. However, you'll normally work primarily with a single table named filter that controls basic packet flow. There are three basic chains in the filter table:

 INPUT: for incoming pakcets. 
 OUTPUT: for outgoing packets. 
 FORWARD: for routed packets. 

Figure 9-5 and 9-6 show simplified flowcharts for where rules are applied to packets in the filter table. There are two figures because packets can either come into the system from a network interface (Figure 9-5) or be generated by a local process (Figure 9-6). As you can see, an incoming packet from the network can be consumed by a user process and may not reach the FORWARD chain or the OUTPUT chain. Packets generated by user processes won't reach the INPUT or FORWARD chains.

Figure9-5.png
Figure9-6.png

This gets more complicated because there are many steps along the way other than just these three chains. For ex.: packets are subject to PREROUTING and POSTROUTING chains, and chain processing can also occur at any of the three lower network levels. For a big diagram for every thing that's going on, see here under, but remember that these diagrams try to include every possible scenario for packet input and flow. If often helps to break the diagrams down by packet source, as in Figure 9-5 and 9-6.

Diagrama linux netfilter iptables.png

Setting Firewall Rules

Let's look at how the IP tables system works in practice. Start by viewing the current configuration with this command:

 # iptables -L

The output is usually an empty set of chains, as follow:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Chain INPUT (policy ACCEPT)
 target     prot opt source               destination         
 
 Chain FORWARD (policy ACCEPT)
 target     prot opt source               destination         
 
 Chain OUTPUT (policy ACCEPT)
 target     prot opt source               destination         

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Each firewall chain has a default policy that specifies what to do with a packet if no rule matches the packet. The policy for all three chains in this example is ACCEPT, meaning that the kernel allows the packet to pass through the packet-filtering system. The DROP policy tells the kernel to discard the packet. To set the policy on a chain, use iptables -P like this:

 # iptables -P FORWARD DROP

Warning: Don't do anythong rash with the policies on your machine until you've read through the rest of this section.

Say that someone at 192.168.34.63 is annoying you. To prevent them from talking to your machine, run this command:

 # iptables -A INPUT -s 192.168.34.63 -j DROP

The -A INPUT parameter appends a rule to the INPUT chain. The -s 192.168.34.63 part specifies the source IP address in the rule, and -j DROP tells the kernel to discard any packet matching the rule. Therefore, your machine will throw out any packet coming from 192.168.34.63.

To see the rule un place, run <codeiptables -L:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Chain INPUT (policy ACCEPT)
 target     prot opt source               destination  
 DROP       all  --  192.168.34.63        anywhere

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Unfortunately, your friend at 192.168.34.63 has told everyone on his subnet to open connections to your SMTP port (TCP port 25). To get rid of that traffic as well, run:

 # iptables -A INPUT -s 192.168.34.0/24 -p tcp --destination-port 25 -j DROP

This example adds a netmask qualifier to the source address as well as -p tcp to specify TCP packets only. A further restriction, --destination-port 25, says that the rule should only apply to traffic to port 25. The IP table list for INPUT now looks like this:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Chain INPUT (policy ACCEPT)
 target     prot opt source               destination  
 DROP       all  --  192.168.34.63        anywhere
 DROP       all  --  192.168.34.0/24      anywhere       tcp dpt:smtp

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

All is well until you hear from someone you know at 192.168.34.37 saying that they can't send you email because you blocked their machine. Thinking that this is a quick fix, you run this command:

 # iptables -A INPUT -s 192.168.34.37 -j ACCEPT

However, it doesn't work. To see why, look at the new chain:

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 Chain INPUT (policy ACCEPT)
 target     prot opt source               destination  
 DROP       all  --  192.168.34.63        anywhere
 DROP       all  --  192.168.34.0/24      anywhere       tcp dpt:smtp
 ACCEPT     all  --  192.168.34.37        anywhere

—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

The kernel reads the chain from the top to bottom, using the first rule that matches. The first rules does not match 192.168.34.37, but the second does, because it applies to all hosts from 192.168.34.1 to 192.169.34.254 and this second rule say to drop packets. When a rule matches, the kernel carries out the action and looks no further down in the chain. (You might notice that 192,168.34.37 can send packets to any port on your machine except port 25 because the second rule only applies to port 25.) The solution is to move the third rule to the top. First, delete the third rule with this command:

 # iptables -D INPUT 3

Then insert that rule at the top of the chain with iptables -I:

 # iptables -I INPUT -s 192.168.34.37 -j ACCEPT

To insert a rule elsewhere in a chain, put the rule number after the chain name (for ex.: iptables -I INPUT 4 ...).

Firewall Strategies

Although the tutorial above showed you how to insert rules and how the kernel processes IP chains, we haven't seen firewall strategies that actually work. Let's talk about that now.

There are two kinds of firewall scenarios: one for protecting individual machines (where you set rules in each machine's INPUT chain) and one for protecting a network of machines (where you set rules in the router's FORWARD chain). In both cases, you can't have serious security if you use a default policy of ACCEPT and continuously insert rules to drop packets from sources that start to send bad stuff. You must allow only the packets that you trust and deny everything else. For example, say your machine has an SSH server on TCP port 22. There's no reason for any random host to initiate a connection to any other port on your machine, and you shouldn't give any such host a chance. To set that up, first set the INPUT chain policy to DROP:

 # iptables -P INPUT DROP

To enable ICMP traffic (for ping and other utilities), use this line:

 # iptables -A INPUT -p icmp -j ACCEPT

Make sure that you can receive packets you send to both your own network IP address and 127.0.0.1 (lo). Assuming your host's IP address is my_addr, do this:

# iptables -A INPUT -s 127.0.0.1 -j ACCEPT
# iptables -a INPUT -s my_addr -j ACCEPT

If you control your entire subnet (and trust everything on it), you can replace my_addr with your subnet address and subnet mask, for ex.: 10.23.2.0/24.

Now. although you still want to deny incoming TCP connections, you still need to make sure that your host can make TCP connections to the outside world. Because all TCP connections start with a SYN (connection request) packet, if you let all TCP packets through that aren't SYN packets, you're still okay:

 # iptables -A INPUT -p tcp '!' --syn -j ACCEPT

Next, if you're using remote UDP-based DNSm you must accept traffic from your name server so that your machine can look up names with DNS. Do this for all DNS servers in /etc/resolv.conf. Use this command (where the name server's address is ns_addr):

 # iptables -A INPUT -p UDP --source-port 53 -s ns_addr -j ACCEPT

And finally, allow SSH connections from anywhere:

 $ iptables -A INPUT -p tcp --destination-port 22 -j ACCEPT

The preceding iptables settings work for many situations, including any direct connection (especially broadband) where an intruder is much more likely to port-scan your machine. You could also adapt these settings for a firewalling router by using the FORWARD chain instean of INPUT and using source and destination subnets where apprpriate. For more advanced configurations, you may find a configuration tool such as Shorewall to be helpful.

This discussion has only touched on security policy. Remember that the key idea is to permit only the things that you find acceptable, not to try to find and execute the bad stuff. Furthermore, IP firewalling is only one piece of the security picture. (you'll see more in the next chapter.)

Ethernet, IP, and ARP

There is one interesting basic detail in the implementation of IP over Ethernet that we have yet to cover. Recall that a host must place an IP packet inside an Ethernet frame in order to transmit the packet across the physical layer to another host. Recall, too, that frames themselves do not include IP address information: they use MAC 9hardware) addresses. The question is this: When contructing the Ethernet frame for an IP packet, how does the host know which MAC address corresponds to the destination IP address? We don't normally think about this question much because networking software includes an automatic system of looking up MAC addresses called Address Resolution Protocol (ARP). A host using Ethernet as its physical layer and IP as the network layer maintains a small table called ARP cache that maps IP addresses to MAC addresses. In Linux, the ARP cache is in the kernel. To view your machine's ARP cache, use the arp command.

 $ arp -n 
 Address                  HWtype  HWaddress           Flags Mask            Iface
 10.1.2.141               ether   0d:95:cb:s6:78:c8   C                     enp0s31f6


When a machine boots, itd ARP cache is empty. So how do these MAC addresses get in the cache? It all starts whent the machine wants to send a packet to another host. If a target IP address is not in ARP cache, the following steps occur:

 1. The origin host creates a special Ethernet frame containing an ARP request packet for the MAC address that corresponds to the target IP address. 
 2. The origin host broadcast this frame to the entire physical network for the target's subnet. 
 3. If one of the other hosts on the subnet knows the correct MAC address, it creates a reply packet and frame containing the address and sends it back to the origin. Often, the host that replies is 
    the target host and is simply replying with his own MAC address. 
 4. The origin host adds the IP-MAC address pair to the ARP cache and can proceed. 

Note: Remember that ARP only applies to machines on local subnets. To reach destinations outside your subnet, your host sends the packet to the router, and ir's someone else's problem after that. Of course, your host still need to know the MAC addrss for the router, and it can use ARP to find it.

The only real problem you can have with ARP is that your system's cache can get out-fo-date if you're moving an IP address from one network interface card to another because the cards have different MAC addresses (for ex.: when testing a machine). Unix systems invalidate ARP cache entries if there's no activity after a while, so there souldn't be any trouble other than a small delay for invalidated data, but you can delete an ARP cache entry immediately with this command:

 # arp -d host

You can also view the ARP cache for a single network interface with:

 $ arp -i interface

The arp(8) manual page explains how to manually set ARP cache entries, but you shouldn't need to do this.

Wireless Ethernet

In principe, wireless Ethernet ("WiFi") networks aren't much different from wired networks. Much like any wired hardware, they have MAC addresses and use Ethernet frames to transmit and receive data, and as a result the Linux kernel can talk to a wireless network interface much as it would a wired network interface. Everything at the network layer and above is the same; the main differences are additional components in the physical layer such as frequencies, network IDs, security, and so on.

Unlike wired network hardware, which is very good at automatically adjusting to nuances in the physical setup wihtout much fuss, wireless network configuration is much more open-ended. To get a wireless interface working properly, Linux needs additional configuration tools. Let's take a quick look at the additional components of wireless networks:

 Transmission details:   These are physical characteristics, such as the radio frequency. 
 Network identification: Because more than one wireless network can share at the same basic medium, you have to be able to distinguish between them. 
                         The SSID (Service Set Identifier, also known as the "network name") is the wireless network identifier. 
 Management:             Although it's possible to configure wireless networking to have hosts talk direclty to each other, most wireless networks are managed by one or more access points 
                         that all traffic goes through. Access points often bridge a wireless network with a wired network, making both appear as one single network. 
 Authentication:         You may want to restrict access to a wireless network. To do so, you can configure access points to require a password or other authentification key before 
                         they'll even talk to a client. 
 Encryption:             In addition to restricting the initial access to a wireles network, you normally want to encrypt all traffic that goes out across radio waves. 

The Linux configuration and utilities that handles these components are spread out over a number of areas. Some are in the kernel: Linux features a set of wireless extensions that standardize user-space access to hardware. As far as user space goes, wireless configuration can get complicated, so most people prefer to use GUI frontends, such as desktop applet for NetworkManager, to get things working. Still, it's worth looking at a few of the things heppening behind the scenes.

iw

You can view and change kernel space device and network configuration with a utility called iw. To use iw, you normally need to know the network interface name for the device, such as wlam0. Here's an example that dumps a scan of available wireless networks. (Expect a lot of output if you're in an urban area.)

 # iw dev wlan0 scan

Note: The network interface must be up for this command to work (if it's not, run ifconfig wlan0 up), but you don't need to configure any network layer parameters, such as an IP address.

If the network interface has joined a wireless network, you can view the network details like this:

 # iw dev wlan0 link

The MAC address in the output of this command is from the access point that you're currently talking to.

Note: The iw command distinguishes between physical device names such as phy0 and network interface names such as wlan0 and allows you to change various settings for each. You can even create more than one network interface for a single physical device. However, in nearly all basic cases, you'll just use the network interface name.

Use iw to connect a network interface to an unsecured wireless network as follows:

 # iw wlan0 connect network_name

Con necting to secure networks is a different story. For the rather insecure Wired Equivalent Privacy (WEP) system, you can use the keys parameter with the iw connect command. However, you shouldn't use WEP if you're serious about security.

Wireless Security

For most wireless security setups, Linux relies on a daemon called wpa_supplicant to manage both authentication and encryption for a wireless network intereface. This daemon can handle both WPA (WiFi Protected Access) and WPA2 schemes of authentification, as well as nearly any kind of encryption techniques used on wireless networks. When the deamon first starts, it reads a configuration file (by default, /etc/wpa_supplicant.conf) and attempts to identify itself to an access point and establish communication based on a given network name. The system is well documented; in particular, the wpa_supplicant(1) and wpa_supplicant(5) manual pages are very detailed.

Running the daemon by hand every time you want to establish a connection is a lot of work. In fact, just creating the configuration file is tedious due to the number of possible options. To make matters worse, all of the work of running iw and <codewpa_supplicant simply allows your system to join a wireless physical network: it doesn't event set up the network layer. And that's where automatic network configuration managers such as NetworkManager take a lot of pain out of the process. Although they don't do any of the work on their own, they know the correct sequence and required configuration for each step toward getting a wireless network operational.

Network Applications and Services

SOON!