Subsections

2018-11-11 4.14-4.18 Linux Kernel Upgrade Networking Woes

I seem to be having bad luck with kernel upgrades. Traditionally, upgrades have worked out smoothly...until, of course, I started this blog, at which point they started providing me with blog material. Normally there wouldn't be anything of particular interest in this blog besides giving an example of thought process for those unfamiliar with it, but necessity caused me to come up with a novel, at least in my experience, hack for manual testing. The rest was legwork.

The Saga Begins

I chose to use the latest stable release rather than a newer Long Term Support (LTS) version of my current kernel. Everything began normally: downloading source, verifying the signature, updating the config with make oldconfig, compiling, installing, and even booting the kernel, except, after booting the kernel I didn't get an IP address. Re-running the service gave me an IP address, but then running ssh would just hang indefinitely. I booted into the old kernel (always keep backups) and networking started working again. Something was broken (use the latest stable release, he said)!

The first step was to contact the appropriate people with a bug report. The kernel source code contained a nice how-to at Documentation/admin-guide/reporting-bugs.rst; this included a format for the bug report as well as a pointer to a script, scripts/get_maintainer.pl, which gave me names and e-mail addresses to submit the bug report to. So I gathered the data, placed it into the bug report format, and fired off an e-mail.

Not content to simply wait for the maintainers to look at my report, I decided to do a little digging to see if I could find which commit broke networking. Since I had moved from 4.14.12 to 4.18.5 and the latest upstream at the time was 4.19-rc3 it seemed prudent to try the latest -rc kernel to see if the bug had been fixed. To my surprise, it worked! In order to compare apples to apples I then tried the 4.18 version, which failed. It seemed that the 4.18 stable kernel series was missing a patch from the 4.19 series.

My first approach was to simply git log | grep for my driver, r8169, and look for relevant fixes. After I had counted 9 relevant commits with many more to go, and realized that each commit would require 2 compilations (one to test if networking is fixed on the commit, another to test that it is broken before the commit), decided that a better method would be to run a git bisect on the source in order to find the fix.

Now, somehow, presumably post-workday delirium, I got it into my head that, when bisecting, using good and bad for descriptive attributes wasn't good enough; thankfully git entertained my delirium and allowed me to rename good to fixed and bad to broken by using git bisect start --term-good=fixed --term-bad=broken, then marking them with git bisect fixed or git bisect broken as needed.

The problem with trying to do this, though, is that each kernel compilation took about half an hour and a manual reboot of my system in order to run the test. Since I was doing this in my free time after work and in-between getting ready for the next day this meant firing off a single compilation when I got home and testing it right before bed. This dragged out the process over many days, and, as I honed in on the final commit it seemed to be in the wrong location (the nearby fixes were for an unrelated driver), and, upon testing it, found that the commit I'd honed in on was indeed incorrect (networking was still broken, though I'd run into two fixed commits earlier). One thing had become clear: I had fucked up.

Thankfully I had been smart enough to take notes and save all of my kernels and their configurations in case this had happened. Since I wasn't sure what to do at this point I began to peruse the git-bisect man page and found the useful git bisect log to show me what I had marked, and, indeed, I had missed a commit in my logs (though I had thankfully saved the kernel). Alas, testing the commit showed that my bisect choices were correct; my notes had missed a single entry but were otherwise accurate. Thus I was left with the annoying task of running through each kernel and checking them for the bug based on my list:

	54dbe75bbf1e - broken
	307797159ac2 - broken
	ee090756962c - broken
	d972604f6f87 - broken
	c81c7012e0c7 - fixed
	2a8a2b7c49d6 - broken
	aba16dc5cf93 - broken
	cf1acec008f8 - fixed
	ac4a5b52f597 - broken
	1eb43fc75448 - broken
	785e76d7a205 - broken
	43f8b22450f0 - broken
	c08eebad4ac5 - broken
	a9910c088647 - broken
Before doing this, however, I decided to get smart: it had become apparent after much testing that judging whether or not networking was working from whether or not I received a DHCP lease during system initialization was only accurate about 90% of the time for whatever reason, so I wrote a quick test that would work all of the time using ping -c 3. Test in hand, running through the pre-compiled kernels was quite fast and I quickly learned that commit cf1acec008f8 was actually broken, not fixed.

Fantastic as it was to find the broken commit, I was still left with a big problem: this process was taking forever (and I'd had a week vacation to slow it all down on top of that) and compiling another 6 kernels, since the ones past cf1acec008f8 were now irrelevant) would take another week at least. From perusing the logs I'd learned earlier about the git bisect run command which could automate testing when provided with the appropriate script, but there were multiple problems with that: first, I had to boot the machine into the kernel under test, which would kill the automated bisection; second, even if I could automatically reboot, the test would hang when trying to decrypt my hard-drive; third, I selected a few non-default options during kernel configuration and it wasn't clear how to programmatically select them. The third problem was feasibly solvable, the first one might be solvable but would require a non-trivial amount of work, and the second one seemed impossible to solve without an unacceptable security compromise. Automation was out of the question.

Forward Compilation

"Necessity is the mother of all invention" as the old saying goes. The (only) useful thing about refusing to give up using my Pentium 4 is that it necessitates finding clever solutions for problems rather than throwing more compute power at terrible code.

In this case my problem was that kernel compilation took half an hour and testing couldn't be automated, thus it took many hours to compile and run tests. Yet I was absent sleeping or working most of the day while my machine idled, but how was I to put that time to good use? I couldn't know which kernel to compile next...if only I could speculate which kernel to compile I could pre-compile it while I was at work then knock out two tests in short order. Alas, I had no means of speculation as both possibilities were equally likely! Then it struck me: why speculate? I could instead compile both kernels if they were equally likely and then discard the unneeded one, thus I'd be enabled to perform two tests at once without the need to speculate. Indeed, this logic could be extended to 3, 4, or even more tests! Eureka!

As a visualization, consider bisecting the following theoretical series of commits named "1" to "64" after their chronological order; this would take 6 steps and can be represented as a binary tree:

\reflectbox{\reflectbox{% _THIS_ exists because otherwise LaTeX2HTML can't figur...
...ode(botline){};
\draw[->] (leftbox) -- (leftbox \vert- botline);
\end{forest}}}
A bisection would travel downward along the tree, following the green line on a fixed commit and the red line on a broken commit. The orange boxes represent the number of commits tested in a single "session"; the default is to test a single commit in a single session, represented by a orange box existing around each individual commit; it would thus take 6 sessions to traverse to the bottom of the tree. Using forward compilation to compile 3 commits worth of kernels at once then produces the following graph:

\reflectbox{\reflectbox{% _THIS_ exists because otherwise LaTeX2HTML can't figur...
...ode(botline){};
\draw[->] (leftbox) -- (leftbox \vert- botline);
\end{forest}}}
In this case it takes a total of 2 sessions rather than 6 to work down the tree. The downside, of course, is that the number of compilations grows exponentially with respect to the number of tests to be run; 7 kernels in order to test 3 commits, and 14 kernels to test the tree. Trying to jam the entire tree into a single session (an orange box around the entire tree) would require compiling all 64 kernels.

Mathematically, the number of tests n that can be run over a given time period t is given by taking the floor of the base 2 logarithm of t divided by average time of compilation c:

$n = \left \lfloor \log_2(\frac{t}{c}) \right \rfloor$
In my case, 30-minute compilations with a 20-hour compilation period meant 5 tests could be done in a day! While this is no academic breakthrough due to its usefulness being limited by exponential growth (a 40-hour compilation period would only give 6 tests), the ability to run 5 tests at once is a huge improvement over running tests one at a time.

The Final Stretch

Excited to try my new technique, I was still delayed by the necessity of automating kernel configuration. Someone suggested that I try using the MIN_CONFIG option of tools/testing/ktest/ktest.pl. I quickly found the tool to be rather unwieldy for my relatively simple task; this was clearly a tool meant for much more than to apply a simple configuration and build. I figured out enough of the tool to have it read my configuration file and apply the minimum configuration, but it then tried to configure again and erred out claiming that the directory was "not clean". Impatient, I decided to simply use make olddefconfig; although not entirely accurate it would most likely be enough for my current issue.

Configuration automation in hand I wrote a quick test script and, after a couple of tries, of course, was able to crank out 2 tests worth of kernels, and another 4 tests (the remaining amount) of kernels the next day while I was at work. Below is a (simplified) version of the script I was able to use:

forward_compile() {
	local commit=$(git log --oneline -n 1 | cut -f 1 -d ' ')
	local log="${LOG}-${commit}"
	# Build kernel.
	# Save results.
	# Compile forward.
	if [ $1 -lt 1 ]; then
		return
	fi
	git bisect log > "${log}"
	git bisect fixed
	forward_compile $(($1 - 1))
	git bisect replay "${log}"
	git bisect broken
	forward_compile $(($1 - 1))
	git bisect replay "${log}"
}
RESULTS="fwdcmpl"
LOG="${RESULTS}/.tmp_bisect_log"
# Parse arguments.
if [ $# -lt 1 ]; then
	echo "USAGE: $0 DEPTH"
	echo "  DEPTH: Tree depth to traverse (2^DEPTH builds will be done)."
	exit 1
fi
# Perform forward compilation.
forward_compile $1
I was lucky that, during the second run, there were exactly 8 commits left so I did not need to test corner-cases. Thus I was able to finish testing 6 kernels in two days rather than six, giving me the following list:
	54dbe75bbf1e - broken
	307797159ac2 - broken
	ee090756962c - broken
	d972604f6f87 - broken
	c81c7012e0c7 - fixed
	2a8a2b7c49d6 - broken
	aba16dc5cf93 - broken
	cf1acec008f8 - broken
	15c480efab01 - broken
	6e0bb04d0e4f - fixed
	6a5d39aa9ac2 - broken
	9a07efa9aea2 - broken
	31fabbee8f5c - fixed
	05212ba8132b - fixed

Alas, it had taken me so long to run all of these tests that the latest stable kernel had moved all the way to 4.18.14. I decided to test this kernel, only to find that networking now worked again! Someone else must have discovered the bug and backported the patch before I could, the bastard! My efforts to find the bug fix were thus rendered futile, but at least I learned a useful trick along the way.

I never did hear back from the maintainers, though I did update them letting them know that the bug had been fixed; annoyingly, my message to the mailing list was rejected with the error "Your address is not liked source for email". Rude! Things were working again, though, and I wasn't keen on fighting the mailing list at this point, especially when I could be generating visuals for this blog...hopefully I don't need to upgrade my kernels before I finish writing it.


Generated using LaTeX2html: Source