real-world rhel6 problem

Joined
Mar 24, 2011
Messages
2,696
Location
CA
Here's an example of a computer problem I see every so often (training and tech support probably not much use).
There's an old HP DL360e gen8 server, has 1 LUN from 2 HDDs, two more basic HDDs. This place likes to put NVidia graphics cards into servers.
In this case, doing so makes booting into rescue mode from a DVD impossible - hangs when discovering hardware. So, have to pull that card.
When I started, I mentally noted it was running using about 8 partitions from /dev/sda, except, /boot is using /dev/sdc1.

A long time ago the host was a RHEL6.5 host (booting rescue mode identifies /dev/sda7 having RHEL6.10, /dev/sdb7
having RHEL6.5). Some time ago a motherboard was replaced, I'm told.
The person patching it got the kernel to a RHEL6.10 + latest patch version, but, when booting you see a much older list of 5 kernel versions,
and you have to choose one maybe from 6.7 or so to boot the box. Why you're presented with an older list is mystery #1.

I tried a yum upgrade and found out /boot gets 100% full, so won't correctly build vmlinuz and initramfs. No problem, go into
/boot and delete the oldest 2 entries and their files. yum upgrade works. Reboot server, still see old selections, still have to choose
an old menu entry. Why didn't the menu change, I hand-edited it myself?

Looking at storage: There's a lot of partitions on /dev/sda and /dev/sdb, and two on /dev/sdc.
blkid shows *identical* UUIDs for:
(1) swap - there's one for each HDD.
(2) /boot - /dev/sda1 and /dev/sdc1. (That's what I remember, anyway).
That's mystery #2.
I changed the UUIDs of the swap partitions and /dev/sdc1, after moving contents from /dev/sdc1 to /dev/sda1.
Reboot.
Get grub> prompt.

Long story short, I think at this stage I've booted into something that has an old view of the host.
Through trial and error, I've figured out if I try manually booting, after I root (hd0,0) I am seeing /dev/sdb1.
I copied all the /dev/sda1 contents to /dev/sdb1, rebooted, and still cannot boot the host,
because, I can't load both a vmlinuz and initramfs that are the same version and have no errors.

I tried the easy way out: booting DVD-ROM and upgrade existing OS. It looks like it writes grub.conf, but
obviously after reboot grub can't see one.


It's baffling why someone would make UUIDs identical. You're at the mercy of obscure edge cases there -
how does the code mount partitions when they are mounted by UUID in /etc/fstab *and* several partitions
have identical UUIDs. grub must refer to them a different way (I have a hunch).
Don't 9 out of 10 sysadmins get away with telling $boss host is unrecoverable, needs to be built again from scratch?
Most of the time I'm that 10th sysadmin, who has some sort of intuition to figure out how to recover the host
from irrecoverable configurations.
Stuff that never shows up on the resume, never asked about in an interview. I use Google, just like the other sysadmins.
Second time I've written a post about an actual problem though. Usually I solve the problem and move on.
The edge cases are very uninteresting to the other sysadmins, and usually something nobody's encountered before.
 
Just a shot in the dark here, can you use a boot DVD to reinstall grub? Configuring it correctly for the system?
 
Going to try this next:

The following steps detail the process on how GRUB is reinstalled on the master boot record:
  • Boot the system from an installation boot medium.
  • Type linux rescue at the installation boot prompt to enter the rescue environment.
  • Type chroot /mnt/sysimage to mount the root partition.
  • Type /sbin/grub-install bootpart to reinstall the GRUB boot loader, where bootpart is the boot partition (typically, /dev/sda).
  • Review the /boot/grub/grub.conf file, as additional entries may be needed for GRUB to control additional operating systems.
  • Reboot the system.
But, I think that'll put it onto a partition other than /dev/sda1. I guess getting the OS to boot is a better choice at this point.
Edit: Oh, I get to choose "bootpart". Ok.
 
Thanks for the write up; definitely one of my weaker areas of knowledge regarding linux servers and one that can't be practiced too often until the skill is needed

This is why I long for the long gone day of engineered systems like the ones Sun (now Oracle) provide.

The OBP aka Open Boot Prompt (BIOS) can boot any OS from any disk and partition; the default boot path is stored in an NV RAM variable on the host.

The x64 BIOS and grub combination is a car wreck in comparison (lack of my expertise in it notwithstanding)
 
There's a way to have grub search for bootable installs and build the list but I can't recall what it is off the top of my head. On the cleanup up of the old kernels there is actually a command to properly do that, which should also remove them from the boot list (not helpful now I know).

The command is: package-cleanup --oldkernels --count=1 (or 2 or 3 or however many old ones you want to keep around). As written, it keeps two: the active one and the next newest.
 
The server's Dymanic RAID card says press F5 to boot its configuration utility, but doing so doesn't.

Several hours later: It has 4 HDDs. 2 have OSes on them, /dev/sda is RHEL6.10; /dev/sdb is RHEL6.5 that hasn't been booted since
Jan 2015. For unknown reasons the server BIOS will load the MBR from HDD2 rather than HDD1.
(They aren't a RAID-1 mirror as I thought earlier.) It is an older (2012) BIOS; I don't have any say in the matter
what HDD it'll try. It's shooting up HDD2.
Now that I made all the UUIDs unique again it's throwing it off from whatever crazy path it's been
taking for the last 5 years. Me trying to " /sbin/grub-install bootpart " slowly
made the problem worse. I don't get a grub> prompt anymore.

Last go-around I used the DVD to upgrade existing OS, the one whose root is /dev/sdb7, and, I asked it to
put a new GRUB on /dev/sdb, putting in menu entries to boot either /dev/sdb7 or /dev/sda7.
The upgrade, 890 packages, occurred, then anaconda exceptioned installing the bootloader.

I think I should get it booted from the DVD in rescue mode, run whatever commands I need to
get vmlinuz and initramfs built, then do the grub-install. I just ran out of time, and every reboot
on a server is 5 minutes. It'll have to wait. I've suggested just buy a new server, build it up,
then copy the data over from the old one. The old one has a lot of problems and a twisted life.
 
Thanks for the write up; definitely one of my weaker areas of knowledge regarding linux servers and one that can't be practiced too often until the skill is needed

This is why I long for the long gone day of engineered systems like the ones Sun (now Oracle) provide.

The OBP aka Open Boot Prompt (BIOS) can boot any OS from any disk and partition; the default boot path is stored in an NV RAM variable on the host.

The x64 BIOS and grub combination is a car wreck in comparison (lack of my expertise in it notwithstanding)

Yeah, one of the things I miss is booting SPARC.
When they first went to x64 it was a real struggle.
 
Booting into the bios configuration doesn't give you the option to set the boot order of the disks?
 
Booting into the bios configuration doesn't give you the option to set the boot order of the disks?

Nope - qualified with it doesn't give me the choice which HDD to boot from. (BIOS is old; I know a gen 10 will, and I think a gen9 might.)
This one hints the boot order of the HDDs is determined by the hardware RAID (B120i), which F5 doesn't get me to.
This post looks promising: https://community.hpe.com/t5/prolia...-b120i-via-acu-or-ssa-help-please/m-p/7072615

" Since the B120i is SATA chipset software based RAID (the RAID function is controlled by a special driver) there really isn't a whole lot to be gained by using it. Many users just set the controller to AHCI mode and use the RAID function built into the OS so they don't have to track an additional driver update. "
Unfortunately, this server's set to Dynamic RAID mode, and warns setting it to another setting will result in data loss.
Just have to live with it booting from HDD2 and try to make it follow with booting the OS on HDD1.
(Makes more sense to get a new server and migrate the data over.)
 
I remember a problem on a Z800 workstation a little less than a year ago, in same area.
It had two HDDs and an OS installed on each. The way they configure their kernels in this area today is to boot with options
rd_NO_LUKS rd_NO_MD rd_NO_DM. Most of the hosts have hardware RAID, so not a big deal.
This Z800 has a software RAID chip. It booted the same convoluted way, off one HDD then off the other.
At that time I thought that was wierd (and unreliable), reconfigured to allow the MD driver to load, and promptly overwrote the contents
of the newer OS HDD with the contents of the older OS HDD.

They must have configured this server with software RAID when they bought it, then around Jan 2015 the new
sysadmins reconfigured it with the "rd_NO_LUKS rd_NO_MD rd_NO_DM" parameters and started the
divergence of OSes on HDD1 and HDD2.
 
I can get it to boot manually - editing the boot menu to say root=/dev/sda7 rather than root=UUID="", which points to the incorrect /dev/sda1 partition. Tomorrow I'll edit the file, making the change permanent.

The server is incorrectly configured to make RAID-y use of the Intel C600/X79 software RAID chip, but, I can't reconfigure it without data loss.

This (subscription required) article talks about how you're supposed to get a driver from HPE in order for it to work properly.
Issue: Some of the HP Gen8 and Gen9 systems are shipping with either a Smart Array B320i, B140i, B120i, B110i, or other Bxxxi controller that requires a closed source driver to make RAID functionality available to the OS.
Part of that article gives this: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03732112
It's called hpvsa. HPE stopped updating it with RHEL6.4. hpvsa-1.2.6-13.rhel6u4.x86_64.dd.gz
Also, the way to install it is via a USB drive during OS install, and USB drives aren't allowed in our area.

But, I can find another driver, kmod-hpvsa-1.2.16-122.rhel6u10.x86_64.rpm.
Probably not a good idea to install it with the 2 HDDs so far apart. No telling if the current would copy data to the old,
or the old would copy data to the current.

So, better to buy their servers with their SmartArray hardware-based RAID.
 
Back
Top