Identifying hardware -
An important step in troubleshooting potential hw issues is knowing exactly which hw is present in a system. For virtual systems, this might seem less useful than for a physical system, but it can tell an admin if the correct virtual devices have been added
Identifying CPUs -
The CPU(s) in a running system can be identified with the lscpu command from the util-linux package.
# lscpu
Another useful piece of info is what flags a CPU supports. These flags indicate whether a CPU supports certain extended technologies, such as AES acceleration, hw-assisted virtualization, & many more. These flags can be inspected in /proc/cpuinfo.
# cat /proc/cpuinfo
Point to Note -
The fact that a CPU supports a certain flag doesn't always mean that the feature is available. For eg, the vmx flag on a Intel CPU indicates that d processor is capable of supporting hw virtualization, but d feature itself might be disabled in the system firmware.
Identifying memory -
The dmidecode tool can be used to retrieve info about physical memory banks, including the type, speed, and location of the bank. To retrieve this information, use the command
# dmidecode -t memory
Identifying disks -
To identify physical disks, an administrator can use the command lsscsi from the lsscsi package. This tool can list all physical SCSI (and USB, SATA, and SAS) drives attached to a system.
# apt-get install lsscsi
# lsscsi -v
For more information, the hdparm command from the hdparm package can be used on individual disks.
# hdparm -I /dev/sda
Identifying PCI hardware -
Attached PCI hardware can be identified with the lspci command. Adding one or more -v options will increase the verbosity.
# lspci
Identifying USB hardware -
USB hardware can be identified using the lsusb command. Just like with lspci, verbosity can be increased by adding -v options.
# lsusb
Hardware error reporting -
Modern systems can typically keep a watch on various hw failures, alerting an admin when a hw fault occurs. While some of these solutions r vendor-specific, and require a remote management card, others can be read from the OS in a standardized fashion.
There are two mechanisms for logging hardware faults, mcelog and rasdaemon.
mcelog -
mcelog provides a framework for catching, and logging machine check exceptions on x86 systems. On supported systems, it can also automatically mark bad areas of RAM so that they will not be used
To install and enable mcelog, follow the following procedure:
1. Install the mcelog package.
# apt-get install mcelog
or
# yum install mcelog
Note - On Ubuntu 18.04 onwards The mcelog package functionality has been replaced by rasdaemon.
From now on, hw errors caught by the mcelog daemon will show up in the system journal. Messages can be queried using the cmd journalctl -u mcelog.service. If the abrt daemon is installed and active, it will also trigger on various mcelog messages.
Alternatively, for administrators who do not wish to run a separate service, a cron is set up, but
commented out, in /etc/cron.hourly/mcelog.cron that will dump events into /var/log/mcelog.
rasdaemon -
A modern replacement for mcelog dat hooks into d kernel trace subsystem. It stands for Reliability, Availability, & Serviceability. It hooks into d Error Detection & Correction (EDAC) mechanism for DIMM modules & reports dem to user space & RAS msgs dat come from kts.
To enable rasdaemon, use the following steps:
1. Install the rasdaemon package.
# apt-get install rasdaemon
or
# yum install rasdaemon
2. Start and enable the rasdaemon.service service.
Information about the various memory banks can be queried using the ras-mc-ctl tool.
Of special interest are ras-mc-ctl --status to show the current status, and ras-mc-ctl -- errors to view any logged errors
Memory testing -
When a physical memory error is suspected, an administrator might want to run an exhaustive
memory test. In these cases, the memtest86+ package can be installed.
Since memory testing on a live system is less than ideal, the memtest86+ package will install a
separate boot entry that runs memtest86+ instead of a regular Linux kernel.
The following steps outline how to enable this boot entry -
1. Install the memtest86+ package; this will install the memtest86+ application into /boot.
2. Run the command memtest-setup. This will add a new template into /etc/grub.d/ to enable memtest86+.
# memtest-setup
There is another utility called memtester.
# apt install memtester
3. Update the grub2 boot loader configuration.
# grub2-mkconfig -o /boot/grub2/grub.cfg
Digging into multiple loggings -
Dmesg allows you to figure out errors and warnings in the kernel's latest messages. For example, here is output of the dmesg | more command:
# dmesg | more
You can also look at all Linux system logs in the /var/log/messages or syslog file, which is where you'll find errors related to specific issues. It's worthwhile to monitor d msgs via the tail cmd in real time when you make modifications to your hw.
# tail -f /var/log/messages
Analyzing networking functions -
You may have hundreds of thousands of cloud-native applications to serve business services in a complex networking environment; these may include virtualization, multiple cloud, and hybrid cloud.
This means you should analyze whether networking connectivity is working correctly as part of your troubleshooting. Useful commands to figure out networking functions in the Linux server include ip addr, traceroute, nslookup, dig, and ping, among others.
Conclusion -
Troubleshooting Linux hw requires considerable knowledge, including how to use powerful command-line tools and figure out system loggings. You should also know how to diagnose the kernel space, which is where you can find the root cause of many hardware problems.
Hope you like the thread. If yes, retweet it. You can follow me for more such content.
Thanks!
β’ β’ β’
Missing some Tweet in this thread? You can try to
force a refresh
Troubleshooting is the art of taking a problem, gathering information about it, analyzing it, and finally solving it.
While some problems are inherently βharderβ than others, the same basic approach can be taken for every problem.
Not just fixing!
While fixing a problem is one of the major parts of troubleshooting, there are other parts that cannot be neglected: documenting the problem (and fix), and performing a root cause analysis (RCA).
Zombie processes in Linux are sometimes also referred to as defunct or dead processes. Theyβre processes that have completed their execution, but their entries are not removed from the process table.
What are different Process States?
Linux maintains a process table of all the processes running, along with their states. Letβs briefly overview the various process states:
What is systemd and why should Linux users care about it?
Everything about "systemd" !!
A Mega Thread π
What is systemd ?
systemd is the glue that holds Linux systems together. systemd is a collection of building blocks, which handle services, processes, logging, network connectivity and even authentication.
systemd handles the boot process for Linux systems. As an init implementation, it has a PID of 1 like other init systems, such as System V, Upstart.
It was designed as a replacement for SystemV and LSB-style startup scrips, which were prevalent since 1980s.
Every Linux Admin or DevOps Engineer should know what happens when a Linux system boots. It's a very popular Interview Question as well.
Every time you power on your Linux PC, it goes through a series of stages before finally displaying a login screen that prompts for your username or password.
There are 3 high level stages of a typical Linux boot process.
Everything you need to know about Virtualization, VMs , Containers, Pods, Clusters ..
A Mega Thread π
What is Virtualization?
Virtualization is the act of dividing shared computational resources: CPU, RAM, Disk, and Networking into isolated resources that are unaware of the original shared scope.
What is a virtual machine?
A VM is a virtual env that functions as a virtual computer system with its own CPU, memory, nw interface, & storage, created on a physical hw system (located off- or on-prem).
It uses sw instead of a physical computer to run programs & deploy apps.