Unix troubleshooting guide

Q: What is meant by NFS?

A: Useful information on NFS, as carried out in the SunOS and Solaris operating systems, is given in this Tips Sheet.  The object of this sheet is to provide an introduction to NFS as well as serve as a guide to the problems, most commonly faced.  Sections 7.4 and 7.5 contain mention of a few of the complete references to NFS, out of many more that exist in reality.

For understanding NFS, following terms bear importance:

NFS SERVER:  The file systems are made available to the network by this machine.  It performs the operation by either of the two methods: EXPORTING (SunOS term) and SHARING (Solaris term).

NFS CLIENT: The available file systems are made accessible by this machine that performs the operation by MOUNTING them.

With NFS are involved daemons of various types:

RPC.MOUNTD:  It can be operated only on NFS servers.  It provides answers to the clients on their initial requests for file systems.
NFSD:  It is operated on NFS servers.  This type of daemons takes action on most of the NFS requests from clients.

On SunOS 4.1.X, BIODS (block I/O daemons):  These daemons, which are not found on Solaris 2.X., provide help to the clients on their NFS requests.

LOCKD and STATD:  This set of daemons keeps trail of locks on NFS files.  Typically, this group of daemons runs on the client and the server.

NFS partitions:  Mounting of these partitions can be done in any of the following two ways: hard or soft.

HARD MOUNTS:  These are permanent mounts that are designed so that they may look like any normal local file system.  In case when no hard mounted partition becomes available, client programs are not able to access it ever.  This situation leads local processes to get locked up when a hard mounted disk disappears.  Hard mounts fall under the category of the default type of mount.

SOFT MOUNTS:  If no remote partition is available, soft mounts will fail after a few retries.  If one writes to the partition, he cannot ever be certain that the write will get processed.  Besides, if that partition disappears, the local processes will not get locked up.  Generally speaking, if one’s intention is only to read from a disk, he should use soft mounts.  It should however be clearly understood that soft mounts are not a reliable type of mount.  If anyone attempts to write to a soft mounted partition, it is almost certain that he is going to have problems.

Q: NFS Error Codes (e.g. NFS write error 49) – what are these?

A: On SunOS, a list of error codes can be found in the intro(2) man page:

# man 2 intro

On Solaris, the /usr/include/sys/errno.h file. SRDB #10946 can be consulted.  This is available through SunSolve that also lists of some of the NFS error codes.

Q: Why does the following error message come to me: Stale NFS file handle?

A1: This is the case when a file or a directory opened by your client has been removed or replaced on the server.  It occurs many times when a dramatic change takes place in the file system on the server; for example, in case when it was shifted to a new disk or entirely erased and restored.  To clear Stale NFS file handle, the client concerned ought to be rebooted.

A2: If rebooting is not preferred, a new mount point can be created on the client for the mount point with the Stale NFS file handle.

Q: Why does the following error message come to me: NFS Server <server> not responding NFS Server ok?


A1: During movements of some NFS traffic, though slowly, if this problem arises intermittently, it means, you have arrived in the domain of performance limitations of either your current network setup or your current NFS server.  SunService has no scope to support such an issue.  Sections 7.4 and 7.5 contain some very good references that may be consulted. The contents of these two sections can be helpful to you in tuning NFS performance.  Section 9.0 can help you to indicate where you can obtain additional support on this issue from Sun.

A2: In case the problem exists for a long period during when there is no NFS traffic movement at all, then there is a possibility that your NFS Server is not available anymore.

Q: How can we create an OS backup in AIX?

A: A backup of the operating system (the root volume group) is produced by the mksysb command.  When a system becomes corrupted, it can be reinstalled to its original state by using this backup.  If the backup is created on tape, the tape becomes bootable and the installation programs required to install from the backup are included in it.

The backup-file format contains the file-system image.  A boot image, a bosinstall image and an empty table of contents followed by the system backup (root volume group) image are included in the tape format.  The backup-file format contains the root volume group image while it starts with the data files that are followed by any optional map files.

Q: How to rename an existing file system in AIX 5.2?

A:  After completion of copying, both file systems should be unmounted and the new one should be mounted on the old mount point (directory).  Then for making the change permanent, edit/etc file systems.

Q: How to recover Solaris root password?

A: Physical access to the machine’s console is required.

Root partition should be noted:

Solaris uses

* /dev/dsk/c0t0d0s0 on the Ultra5/10 and Blade 100
* /dev/dsk/c0t1d0s0 for Blade 1000.

Press the STOP and A keys simultaneously, or, on an ASCII terminal or emulator, send a <BREAK>) to halt the operating system, if it’s running.

Boot single-user from CD-ROM (boot cdrom -s) or network install/jumpstart server (boot net -s). For CD media use the CD-ROM labeled “Installation”.  As prom password is set, you are required to learn it.

Mount the root partition on “/a” which is an empty mount point that exists at this stage of the procedure for installation. For example:

#mount /dev/dsk/c0t0d0s0 /a

In case of the failure of the mount command with “/a” existing always, it should be inferred that either your typing was done in a wrong device, or the system is treating the root partition as something else.

Do  “ls /tmp/dev/dsk” and see what is there. “c0t6” things are the CD-ROM, what is left is what one needs to try. On a Blade 1000/2000,  choose /dev/dsk/c1t1d0s0, and execute: #mount /dev/dsk/c1t1d0s0 /a

Set your terminal type so that you may use a full-screen editor, such as vi. This step can be skipped over if the method of using “ex” or “vi” from open mode is known to you.

* If you’re on a sun console, type “TERM=sun; export TERM”;
* If you are using an ascii terminal or terminal emulator on a PC for your console, set TERM to the terminal type for example: TERM=vt100; export TERM.

Edit the password file, /a/etc/shadow (or perhaps in older versions, /etc/passwd) and remove the encrypted password entry for root.

Type: “cd /; then “umount /a”

You are to reboot as normal in single-user mode (“boot-s”).  Since the root account will not have a password, you are to give it a new password by use of the password command.  PROM passwords:  It is natural that you may not like anybody to have physical access to the machine to get the opportunity to do the above to delete the root password.  Suns offer a security password mechanism in the PROM which can be set (this is turned off by default).  This feature is described in the man page for the eeprom command.

While booting from CD-ROM or installing server calls for the prom password, the machine only will be booted without the prom password from the default device if security-mode is set to “command”.  Change of the root password in this case necessitates shifting of the default device (e.g. the boot disk) to a different SCSI target (or equivalent) and replacing of it with an equally bootable device with known root password.  The machine cannot be booted without the use of the prom password, even from the default device, if the security-mode is set to full.  If this is to be defeated, replacement of the NVRAM on the motherboard will be necessary.  “Full” security is accompanied by its drawbacks.  If the machine is power-cycled (e.g. by a power outage) or stopped (e.g. by STOP-A) during normal operations, it becomes unable to reboot without application of the prom password.

Q: How to find out what MQSeries version is installed on Solaris server?

A: Entering command  pkginfo –l mqm will display the detailed information about MQSeries mqm package installed on Solaris server.  Of other things, the version of installed MQSeries will also be displayed.

If, on entering pkginfo –l mqm  command, nothing comes into display, this indicates that MQSeries has not been installed on Solaris server.

Q: How to find out default Queue Manager’s name?

A: Entering command  awk /DefaultQueueManager:/,/Name=/ /var/mqm/mqs.ini  will bring into display the name stanza of the default Queue Manager.

If on entering command awk /DefaultQueueManager:/,/Name=/ /var/mqm/mqs.ini  nothing comes into display, this indicates that default Queue Manager has not been defined (i.e., Queue Manager’s name should be defined clearly in all MQSeries commands).  This command is quite long and so it deserves well a .profile-defined alias.

Q: How to find out the logging type of Queue Manager?

A: Entering command  grep LogType /var/mqm/qmgrs/queue_manager/qm.ini will  bring into display the logging type stanza relating to the specified Queue Manager.  (Circular logging causes only restart recovery, while linear logging supports restart as well as cause media recovery; after creation of Queue Manager, logging type cannot be changed.)

Q: How to configure aix 5.3 for using a smtp relay server?

A: For the purpose of relaying all local e-mail traffic, a mail hub should be configured by changing the ‘DH‘ entry in sendmail.cf.  For relaying all email traffic, not existing in the local domain, a Smart Relay should be created using the ‘DS‘ entry there. Restart sendmail after making the modifications (refresh -s sendmail) .

Q: What meaning has High availability?

A: High availability simply means causing to make an application highly available.  It does not of course mean making the hardware highly available. It is observed that almost in all cases the users do not care whether or not the server is running, while a large number of users are found to care whether the application(s) running on the server is/are running.  The matter of making the application highly available should be the main focus.  The use of the hardware should be considered merely a tool to achieve the intended high availability of the application.

While conducting a fair number of HACMP courses for IBM, I have observed that a lot of learners initially find it difficult to understand the basic HACMP concepts until it occurs to them that it is not really the hardware but the application whose high availability is desired.  Once they are able to grasp this concept, the remaining part of the course becomes easy for them to understand.

Q: How Solaris Device Tree can be rebuilt?

A: Within a Sun Solaris box, if you ever move around any bootable drives, you may observe either of the following two situations: (1) the device names (cxtxd0sx) do not correspond to the disk position within the server, (2) the system is unable to boot because it is not being able to mount the other disk slices.

For clearer understanding, let us consider the following example: We are booting off of target 8 (c1t8d0s0).  Now to make it target 0 (c1t0d0s0), we want to move that disk to the appropriate slot.  All references in the /etc/vistab file have been modified by us to reflect the new disk position.  We have then physically shifted the drive from the target 8 slot to the target 0 slot.  We have also changed the boot-device variable within the OBP to the appropriate disk.  We are now all set to boot from the disk in target 0 – understand?

Not quite yet.

A device tree with links to all the disks known to it is created by Solaris, and upon reboot, these do not get reconstructed.  If now in target 0, the disk is attempted to be simply booted, it would find the kernel, but it will not be able to mount any of the other file systems.  This is because these device links are yet pointed to the disk slices on target 8.

For booting off the drive in the new position, these device links are required to be eliminated and reconstructed.  The method of doing that is as follows:

Into the hosts cdrom, a Solaris 8, 9 or 10 should be introduced.

boot cdrom-s should be entered from the ok prompt.

Keep it in mind that the above example may be different from your boot disk.  In the above example, the disk has been inserted into the slot for target 0 (c 1 t0d0}, that is being used here by us.
The root slice should be mounted on /mnt

It is worth mentioning that the root slice may be different than that of the above example.

Shift path to inst

Eliminate all old device links.

Reconstruct path to inst and devices.

The root slice should be unmounted and rebooted.

Now boot off your old drive in its new slot.

Q: How to eliminate an LPAR through HMC via command line ? HMC is not accessible through GUI

A: At the beginning, let us investigate why the HMC cannot be accessible through GUI. Naturally this sort of situation is not worth toleration for a long period!

However, eliminating an LPAR via CLI is not at all difficult –

Just login to the HMC as ‘hscroot‘ by use of ssh and issue:

rmsyscfg -r lpar -m [managed-system] -n [partition-name]

where managed-system means your managed system and partition-name means your partition.

To remember those names, you may list resources by issuing:

lssyscfg -r sys    to list all managed systems, or:
lssyscfg -r lpar -m [managed-system]      to list all lpars in a managed system

Q: Would you narrate the good principles, methods or best practices during trying to debug general unix problems?

A: It is the type of the problem that will determine the methods..

The logfiles of your system/application writes are your first source of information during debugging.  Your terminal or a file in /var/log/is the common venue for them. If you are not able to find any usable messages you should increase different types of loglevels that are supported by many applications.  Many times more messages are obtained through – v verbose switch.

If nothing usable is yet there, you should check your configuration files, the application requires permissions of the files.  It is likely that the config of your systemlogger require modification, for instance  /etc/syslog-ng.conf.

Should any error message is received by you, many times you can reach the message board entries or usenet postings by means of google search.  It is very probable that you may find there a solution of your problem.  Helpful tips to solve your problem may also be available in the project user’s mailing lists, message boards and IRC channels.

Applications are found at some of the times to crash without any message.  In addition to reading and changing the code,  strace is a big tool to find out the flow of application.

The task of this tool is to find out system calls and signals.  One can yet find problems in the systrace even when errors are discovered by the application.

gdb provides scope of another approach to debug the application.  As a user, you ought to have advanced knowledge about the method of using it.

A good single general principle to conduct the operation of debugging can be narrated as follows:  You should try to clearly understand the following: (a) method of working of the system, (b) each component of the system, and (c) failure modes of every component.  Always keep your knowledge updated about: (a) what are the components  that have been changed by you recently, (b) which are the components that have changed or failed automatically.

Q: What are the matters that should be known and understood by the people engaged in troubleshooting Unix?

A: Unix is a language written by human beings, and so, it is natural that it should contain ambiguities and inconsistencies in addition to history and culture.  The fact is that, you yourself require an operating system whereas your computer doesn’t.  Computers, being electronic devices, are just sensitive to high and low voltages.  In order to utilize the resources of the computer, operating systems like Unix were written by human beings who have affluence of interesting ideas.  So Unix operating system reflects eccentricities.  Full consistency of the sequences of commands and options is not therefore to be found.  From command to command, inconsistencies in syntax can be observed.  The operating system is characterized by prevalence of cultural artifacts that owe its origin from Greek culture which is involved in its birth and rise.

Q: A Unix problem – how should it be viewed and tackled by people?

A: Expect errors, mistakes, problems and funny things to appear in Unix and computers that are a mystery to be lived with.  Keep it in mind that you are dealing with a complex formed by hardware, software, networks and operating systems.  So when a problem arises, directly and immediately try to apply some troubleshooting techniques to solve the problem, instead of trying to dig deep into the complex to find out what has really happened, which is a time-consuming process.  Because, owing to the nature of the Unix systems, by the time you will dig out what has really happened, some other new problems may arise putting you in a greater trouble!  However, you should read the error messages as these may give you some useful clue.

Q: Can a solution be always found for a problem?

A:  Bear in mind that there may be many ways to solve a problem.  So do not be too insistent either to find out only one way to perform your task or to dig out all the possible ways of performing it.  Most of the Unix users get accustomed to a special set of habits in respect of using commands in accordance with their individual thought patterns.  If you find others taking recourse to a different method of Unix operation, do not be worried; rather try to learn something from them.  Give up the habit of continually suggesting to others that they could do something in a particular way, because there are no ends of such suggestions! The best thing for you is to find out a method that matches your intellect and instrument best to solve your problem.

Q: Shall I be wrong if I think that Unix is outdated?

A: It may be thought that a command-line operating system like Unix is outdated.  But the important point is this that by Unix you can accomplish your task.  Apart from all other considerations of using pipes, filters, shell programming etc, the basic truth is that Unix is a very powerful tool and can very efficiently accomplish a large amount of precise works.

Q: One should know which is the proper time to quit?

A:  Typing Unix command at the shell prompt will produce a wrong symbol, that may not be possible to be corrected by backspace or delete keys, which may not work as per expectations.  In case you obtain some funny or strange characters, you are required to have the command line cancelled, that may be done by pressing and holding down the control key and the C key simultaneously.  Not doing this may result in the creation of some strange file or unknown effect.  There is need to exercise special vigilance while creating a name for a file.

Q: On a Unix or Linux server in a NIS domain, RALUS has been installed, but Backup Exec cannot browse resources on the server.  What to do?

A: The nsswitch conf file configuration should be verified in the following method:

Extra configuration of the /etc/passwd and /etc/group files will be required if the group and passwd lines in the nsswitch.conf file are set to compat mode. For more information on configuring nsswitch.conf to use compat mode, reference should be made to the nsswitch.conf man pages.

As an alternative, the passwd and group lines to “nis files” can be modified so that the user is validated by the UNIX or Linux system through NIS. The local files are to be utilized for validation if the NIS server is not available or the user is unavailable.

Q: I cannot load beremote agent. When trying to load beremote in console mode, “./beremote –log-console” sends the following message. “ACE_SV_Semaphore_Complex: no space left on device.” What should I do now?

A:  Reaching of the computer its maximum limit on allowable semaphores causes occurrence of this issue.  An unanticipated ending of the beremote agent can also cause it to occur.  Beremote cannot clean up some of the semaphore resources used by it when the beremote agent unpredictably tereminates.  The use of semaphores reaching the limit may have been caused by other processes as well.  Restarting the computer is the only safe way to salvage the computer from this situation.

Restarting of the computer may not be possible due to other processes running in it.  In such a case, you can list and then clear up all semaphores used by the operating system by means of a series of commands.  Unluckily, no way exists to correlate which semaphores are being used by the beremote agent.  So it is essential to exercise every care to see that the right semaphores are selected for the purpose of cleaning up. One of the causes of those applications’ becoming unstable may be the cleaning up of semaphores of other programs in use.

You should use the following command for listing semaphores:

ipcs –a

For eliminating semaphores for each identifier enlisted by the ipcs-a command, the following command should be used:

ipcrm -s <id>

There appears a ‘user attention’ econ.

Q: I fail to load the Remote Agent.  When trying to load the Remote Agent in console mode,/beremote –log-console, I receive the following message:

Error while loading shared libraries: libstdc++.so.5: cannot open shared object file: No such file or directory.

What should be done now?

A: The cause of this error is that the libstdc++.so.5 library is not in the /usr/lib directory.  The role of this library is to cause the Remote Agent start and function.  Resolution of this issue calls for installation of the libstdc++5 package.

This package is available for installation from the media on which the copy of Linux was available. Or, the under-mentioned command can be run from a computer with internet access:  apt-get install libstdc++5.

For SUSE Linux Enterprise Server 11, the following command can be run:

zypper install libstdc++5

Q: What is meant by Nohup and what are the methods of its use?

A: Nohup is the name of a Unix command that performs interesting functions.  It negates the action of the HUP (hang-up) signal as well as run another command that keeps running even when the user issuing the command has logged out.

Now the pertinent question is: can SIGHUP (hangup signal) be maneuvered?  Fortunately this can be captured and defined to act in certain ways, such as, calling a function, ignoring it or bringing back the default action. The default action on POSIX-compliant systems results in an abnormal ending up of the program.

A program begins with opening descriptors for stdin, stdout and stderr streams.  You may want your program to close those descriptors or forward them to /dev/null. On receiving SIGHUP, you may be interested to use “nohup” to obstruct the program against ending up abnormally e.g. let us begin the java program using nohup and forwarding all the 3 descriptors to /dev/null during when the program is running in the backdrop.

nohup java -cp . Test </dev/null >/dev/null 2>&1 &

If you log out of the shell, this can make the shell hang on logout due to a race condition.  In such a situation, it is recommended that you add nohup to background jobs.  This problem can also be set aside by forwarding all three I/O streams e.g. the stdout to foo.out, stderr to foo.err and stdin to /dev/null.

nohup myprogram > foo.out 2> foo.err < /dev/null &
Q: How to find what are within a gzipped file without extracting it?

A: Suppose, we want to see the content of a compressed file in the gzip format and search a string within it as if it was a normal file. One method would be to gunzip it and with less command at the shell prompt it can be seen what are within the file.  This method has two draw backs.

1 – To gunzip a file, it consumes time.  Thereafter, the contents can be seen and it is to be gunzipped again.

2 – Sometimes unzipping the file may not be an option.  The size of the zipped file and the space available on the hard drive are often the determining factor.

With unix pipes, the task may be performed as follows:

gunzip -c filename.gz | less -ni

Thereafter, it is possible to see or search all that are contained in the file.


Q: Standard process streams – what are they?

A: A process opens three I/O connections/channels that are called standard input, standard output and standard error.  Standard input means input from the user through some input device (key board, mouse etc.)  Standard output (like screen etc.) medium is to display or send some messages containing useful information and/or error reports.

Standard input (stdin)

This is the input to the program.  The stream arrives on opening up the program and data, if any, sent to stream waits to be read.  Input is not needed by all programs. As an example, the unix command Is program or the windows command dir do not need stream data input for its operation.

The starter of the program the text keyboard is expected to give input, unless forwarded.  0 (zero) is the file descriptor for standard input.

Standard output (stdout)

This means the stream as opened by the program that writes the data.  Output is not generated by all programs. The file renaming command, for example, does not act on success.

Text terminal is the standard output, unless forwarded.  For standard output, the file descriptor is 1 (one).

Standard error (stderr)

Typically used by the programs, standard error is another output stream that is free of standard output and can be forwarded independently.  The initiator of the program the text terminal is the normal goal.

Standard output and standard error are sent to the same destination.  As the program writes them, messages also come in the same order, unless technically handicapped.

2 (two) is the file descriptor for standard error.

Q: What is meant by dev/null?

A: The different devices in the computer are referred through /dev/sda(Hard Drive), /dev/cdrom (CD Drive) etc in Unix as well as similar operating systems . Null device (referred by /dev/null) is a distinctive file which rejects all data written to it (provided write operation was successful), and sends no data to any process.

It is also known as bit bucket or black hole in Unix jargon besides programmer’s jargon.  In fact, anything cannot be read or written to it.

/dev/null being a distinctive file and not a directory, files cannot be shifted to it with Unix.my command.  Any other file maneuvering commands cannot also be applied to it.


Question therefore arises why should you have it?

The typical function of the null device is to dispose of undesirable output streams of a process.  It may also be treated as a convenient empty file for input streams.  Forwarding is the usual process of doing it.

Q: How to forward output and errors from a process to /dev/null?


A: Almost all the time, we have no interest in the output, we act on a script (for instance, cron job etc.).  So by forwarding it to /dev/null, how can it be ignored?

Null device (referred via /dev/null) is a distinctive file that rejects all data written to it (provided that the write operation was successful), and sends no data to any process that reads from it (producing EOF immediately).

So while operating the script or programs from the line of command, >/dev/null can be attached to the line of command so as to forward output to the bit bucket in lieu thereof.

Anyway, while these are operated from a cron job, the output sent by the script to stderr is forwarded as a mail to the script’s proprietor.  If the script operates regularly, this becomes very disturbing, with output you neither need nor want, nor can do anything about it.

You can nicely operate this in any shell by forwarding the output of stderr to stdout, and then again forwarding this pooled output to /dev/null:

./script.sh >/dev/null 2>&1

The interesting feat here is as given below:

We forward standard output to /dev/null, using > /dev/null in the first part.
2. In the 2nd part, 2>&1 makes it certain that we forward the standard error (file descriptor 2) to standard output (file descriptor 1) wherever it may be going.  It is already confirmed that it is going to the bit bucket.

Q: Why am I thrown off #Unix when I use IRC as root?

A: It is not a good idea to treat IRCing as root.  Even if it is treated as root in the first pace, that is also not a good idea.  Thanks to Unix that has multiuser capabilities which should be used by you. If you by accident rm -rf / as a user than you will do as root, your system will be protected from damage to a great extent.  With regard to IRC’ing as root, some specific scripts and/or IRC clients can be victims of bugs or “trojan horses” that can harm your computer, or undermine its security.  In spite of your being certain about your client to have no holes, it is our principle to protect you from yourself.  Therefore, if you IRC as root, and join #unix, you will be kicked, provided that root is not already prohibited.  A number of IRC servers have already created a situation that it is not possible for a root to join their server at all.


Q: How to unshar a file?


A: Files stored in a shell script are called shar files.  To begin with, your editor should be loaded with the script making it sure that the first line says “#! /bin/sh”.  If it does not say that, lines should be eliminated until it does say that.   Thereafter save and sh file.shar. This will make the files dumped into the directory in which you are.

Q:  As UNIX can multitask — how can I switch between programs?

A: For UNIX job control, the commands fg, bg, and the control sequence control-Z (^Z) are made use of.  If presently you are operating a program, ^Z will suspend it and send you back to your shell.  You can know what jobs you are operating by typing jobs. To activate a program in the background, type bg %# where # is the number of the job contained in the list of jobs.

The fg command makes the process come back to the foreground. This is really simple although apparently it appears much more complex. More info are available in the manpage on your shell (csh, tcsh, and bash are common) that should be studied. To kill and ps, checking out the manpages is also necessary.



This entry was posted in Z Others and tagged , , , , , , , , , , , , . Bookmark the permalink.