Tracking down a bug

A few days ago I ran into a problem with my Tumbleweed install.  In this post, I’ll describe what happened and how I investigated it.

I had just updated to the 20141225 snapshot.  The availability of that snapshot had been announced on the factory mailing list.  I updated using the command

# zypper dup

During the dialog for the update, I noticed that this included an updated kernel (3.18.1-1-desktop).

Rebooting

After the update, I rebooted to make sure that I was using the newly installed software.  And that’s where I ran into problems.  My Tumbleweed system is using an encrypted LVM.  So, as expected, I was prompted for the encryption key during the boot.  When I tried to type in the key, nothing happened.  It looked as if the keyboard was not being read.  Note that this system is using a generic USB keyboard.  It’s actually a Dell keyboard, though the computer itself is a Lenovo.

The only way out was to power off.  I tried CTRL-ALT-DEL.  But if the keyboard was not being recognized, that would not have any effect.  So I powered off.  Then I tried again, in case this was a one-time glitch.  However, I had the same problem on the second try.

I powered off again.  And I rebooted again.  But this time, at the “grub2” menu, I selected “Advanced Options”.  That allowed me to select a different kernel to boot.  I selected the 3.17.4-1-desktop that I had been previously using.  And the system booted fine with that kernel.  I was prompted for the encryption key and was able to supply the key.  The system continued to a full boot.

That told me that I had a kernel problem.  Apart from the new kernel, everything seemed to be doing fine.

Was this an initrd issue

The next step was to determine whether the kernel was faulty, or the “initrd” was faulty.

Perhaps a word of explanation here.  The “initrd” is an initial ramdisk that is used during boot.  Booting the system loads the kernel and gives it the “initrd” data for the initial ramdisk.  That is supposed to give the newly booting system enough software to get started.  The kernel itself and the “initrd” file are read using BIOS calls (or UEFI firmware calls).  Beyond that, all input/output uses the services of the newly loaded kernel.  The “initrd” has to contain any necessary drivers that are used during the remainder of booting.

So the issue I had to investigate, was whether the kernel itself was incapable of reading the keyboard.  Or whether the kernel needed a driver to read the keyboard, but that driver (a kernel module) was not found in the “initrd”.

I downloaded the live rescue CD for this snapshot.  That was file “openSUSE-Tumbleweed-Rescue-CD-x86_64-Snapshot20141225-Media.iso” which I found at the Tumbleweed download site.  I wrote that live image to a USB, and proceeded to attempt to boot that USB.

The USB boot went fine.  I was using the live rescue system, because that did not depend on my encrypted LVM.  So I could probably complete the boot unless there were other problems.

Once I had the live rescue system booted, I checked whether it could read the keyboard.  It could.  So this seemed to show that the kernel itself was fine, but I was missing a driver in the “initrd”.

My next step was to get a list of the loaded kernel modules for the running system (the live rescue system).  For that, I used:

# lsmod

and I redirected the output to a file.

My guess, at this time, was that there had been a change in the kernel that required an additional driver that was not being included in the “initrd”.  So I now rebooted my system to the working kernel 3.17.4-1-desktop.  When running that kernel, I again used the “lsmod” command to get a list of loaded modules.  My aim was to see which modules were being used with the 3.18.1 kernel but not with the 3.17.4 kernel.

Taking a short cut

At this stage, I took a short cut.  From my experience, I expected the modules for USB devices to the string “hci” as part of their name.  So I started comparing such modules from the two lists.  This was simply a matter of using the “grep” command.

The “lsmod” output for the 3.18.1 kernel included modules “xhci_pci” and “xhci_hcd”.  The output for the 3.17.4 kernel included only the second of those.

At this stage, I was pretty sure that I had found the problem.  The next step was to track down where things were going wrong.  The output from

cd /etc
ls | grep dracut

showed that dracut settings were in the file “/etc/dracut.conf” and in the directory “/etc/dracut.conf.d”.  Note that “dracut” is the name of the software that generates the “initrd”.

There was a line in “/etc/dracut.conf” where I could probably add the name of the kernel module to force it to be included.  However, I decided to look in “/etc/dracut.conf.d”.  One of the files there had useful comments, which suggested that I look at “/usr/lib/dracut/modules.d/”.  From that directory I was able to find “/usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh” which is a shell script used to generate the list of modules to add to the “initrd”.  It was then simple to modify that script to also add the “xhci_pci” module.

Testing

With that change made, I ran

# mkinitrd

to regenerate the “initrd” files.  I then rebooted to the 3.18.1 kernel to test.  And all was fine.  I was prompted for the encryption key.  And this time, I was able to enter the key.  The system booted up normally.

The remaining step was to post a bug report on the issue.  I reported it as Bug 911319.

Another user has a problem

Yesterday, I noticed that someone else was having what seemed to be a related problem.  He described his problem in a forum post:

His problem did not involve an encrypted LVM.  But he was booting from a USB drive, and his description indicated that the kernel was unable to access that drive.  So it looked as if it could be caused by the same problem.

I replied, suggesting that he check that possibility.  And it turns out that he was able to resolve his problem with the same work-around.

An additional note

Whether this might affect your system depends on your hardware.  Some USB devices use “xhci” drivers.  Others use “ehci” drivers or “ohci” drivers (and perhaps other drivers).  You will only run into this problem if your computer requires the “xhci” drivers.  From what I can see, it looks as if this applies to computers with USB3 support.  However, using a USB2 port might not help.  On my systems with USB3, the USB2 devices also use the “xhci”drivers.

Advertisements

Tags: , , ,

About Neil Rickert

Retired mathematician and computer scientist who dabbles in cognitive science.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: