Here's an example of a computer problem I see every so often (training and tech support probably not much use).
There's an old HP DL360e gen8 server, has 1 LUN from 2 HDDs, two more basic HDDs. This place likes to put NVidia graphics cards into servers.
In this case, doing so makes booting into rescue mode from a DVD impossible - hangs when discovering hardware. So, have to pull that card.
When I started, I mentally noted it was running using about 8 partitions from /dev/sda, except, /boot is using /dev/sdc1.
A long time ago the host was a RHEL6.5 host (booting rescue mode identifies /dev/sda7 having RHEL6.10, /dev/sdb7
having RHEL6.5). Some time ago a motherboard was replaced, I'm told.
The person patching it got the kernel to a RHEL6.10 + latest patch version, but, when booting you see a much older list of 5 kernel versions,
and you have to choose one maybe from 6.7 or so to boot the box. Why you're presented with an older list is mystery #1.
I tried a yum upgrade and found out /boot gets 100% full, so won't correctly build vmlinuz and initramfs. No problem, go into
/boot and delete the oldest 2 entries and their files. yum upgrade works. Reboot server, still see old selections, still have to choose
an old menu entry. Why didn't the menu change, I hand-edited it myself?
Looking at storage: There's a lot of partitions on /dev/sda and /dev/sdb, and two on /dev/sdc.
blkid shows *identical* UUIDs for:
(1) swap - there's one for each HDD.
(2) /boot - /dev/sda1 and /dev/sdc1. (That's what I remember, anyway).
That's mystery #2.
I changed the UUIDs of the swap partitions and /dev/sdc1, after moving contents from /dev/sdc1 to /dev/sda1.
Reboot.
Get grub> prompt.
Long story short, I think at this stage I've booted into something that has an old view of the host.
Through trial and error, I've figured out if I try manually booting, after I root (hd0,0) I am seeing /dev/sdb1.
I copied all the /dev/sda1 contents to /dev/sdb1, rebooted, and still cannot boot the host,
because, I can't load both a vmlinuz and initramfs that are the same version and have no errors.
I tried the easy way out: booting DVD-ROM and upgrade existing OS. It looks like it writes grub.conf, but
obviously after reboot grub can't see one.
It's baffling why someone would make UUIDs identical. You're at the mercy of obscure edge cases there -
how does the code mount partitions when they are mounted by UUID in /etc/fstab *and* several partitions
have identical UUIDs. grub must refer to them a different way (I have a hunch).
Don't 9 out of 10 sysadmins get away with telling $boss host is unrecoverable, needs to be built again from scratch?
Most of the time I'm that 10th sysadmin, who has some sort of intuition to figure out how to recover the host
from irrecoverable configurations.
Stuff that never shows up on the resume, never asked about in an interview. I use Google, just like the other sysadmins.
Second time I've written a post about an actual problem though. Usually I solve the problem and move on.
The edge cases are very uninteresting to the other sysadmins, and usually something nobody's encountered before.
There's an old HP DL360e gen8 server, has 1 LUN from 2 HDDs, two more basic HDDs. This place likes to put NVidia graphics cards into servers.
In this case, doing so makes booting into rescue mode from a DVD impossible - hangs when discovering hardware. So, have to pull that card.
When I started, I mentally noted it was running using about 8 partitions from /dev/sda, except, /boot is using /dev/sdc1.
A long time ago the host was a RHEL6.5 host (booting rescue mode identifies /dev/sda7 having RHEL6.10, /dev/sdb7
having RHEL6.5). Some time ago a motherboard was replaced, I'm told.
The person patching it got the kernel to a RHEL6.10 + latest patch version, but, when booting you see a much older list of 5 kernel versions,
and you have to choose one maybe from 6.7 or so to boot the box. Why you're presented with an older list is mystery #1.
I tried a yum upgrade and found out /boot gets 100% full, so won't correctly build vmlinuz and initramfs. No problem, go into
/boot and delete the oldest 2 entries and their files. yum upgrade works. Reboot server, still see old selections, still have to choose
an old menu entry. Why didn't the menu change, I hand-edited it myself?
Looking at storage: There's a lot of partitions on /dev/sda and /dev/sdb, and two on /dev/sdc.
blkid shows *identical* UUIDs for:
(1) swap - there's one for each HDD.
(2) /boot - /dev/sda1 and /dev/sdc1. (That's what I remember, anyway).
That's mystery #2.
I changed the UUIDs of the swap partitions and /dev/sdc1, after moving contents from /dev/sdc1 to /dev/sda1.
Reboot.
Get grub> prompt.
Long story short, I think at this stage I've booted into something that has an old view of the host.
Through trial and error, I've figured out if I try manually booting, after I root (hd0,0) I am seeing /dev/sdb1.
I copied all the /dev/sda1 contents to /dev/sdb1, rebooted, and still cannot boot the host,
because, I can't load both a vmlinuz and initramfs that are the same version and have no errors.
I tried the easy way out: booting DVD-ROM and upgrade existing OS. It looks like it writes grub.conf, but
obviously after reboot grub can't see one.
It's baffling why someone would make UUIDs identical. You're at the mercy of obscure edge cases there -
how does the code mount partitions when they are mounted by UUID in /etc/fstab *and* several partitions
have identical UUIDs. grub must refer to them a different way (I have a hunch).
Don't 9 out of 10 sysadmins get away with telling $boss host is unrecoverable, needs to be built again from scratch?
Most of the time I'm that 10th sysadmin, who has some sort of intuition to figure out how to recover the host
from irrecoverable configurations.
Stuff that never shows up on the resume, never asked about in an interview. I use Google, just like the other sysadmins.
Second time I've written a post about an actual problem though. Usually I solve the problem and move on.
The edge cases are very uninteresting to the other sysadmins, and usually something nobody's encountered before.