The first step was to contact the appropriate people with a bug report. The kernel source code contained a nice how-to at Documentation/admin-guide/reporting-bugs.rst; this included a format for the bug report as well as a pointer to a script, scripts/get_maintainer.pl, which gave me names and e-mail addresses to submit the bug report to. So I gathered the data, placed it into the bug report format, and fired off an e-mail.
Not content to simply wait for the maintainers to look at my report, I decided to do a little digging to see if I could find which commit broke networking. Since I had moved from 4.14.12 to 4.18.5 and the latest upstream at the time was 4.19-rc3 it seemed prudent to try the latest -rc kernel to see if the bug had been fixed. To my surprise, it worked! In order to compare apples to apples I then tried the 4.18 version, which failed. It seemed that the 4.18 stable kernel series was missing a patch from the 4.19 series.
My first approach was to simply git log | grep for my driver, r8169, and look for relevant fixes. After I had counted 9 relevant commits with many more to go, and realized that each commit would require 2 compilations (one to test if networking is fixed on the commit, another to test that it is broken before the commit), decided that a better method would be to run a git bisect on the source in order to find the fix.
Now, somehow, presumably post-workday delirium, I got it into my head that, when bisecting, using good and bad for descriptive attributes wasn't good enough; thankfully git entertained my delirium and allowed me to rename good to fixed and bad to broken by using git bisect start -
-term-good=fixed -
-term-bad=broken, then marking them with git bisect fixed or git bisect broken as needed.
The problem with trying to do this, though, is that each kernel compilation took about half an hour and a manual reboot of my system in order to run the test. Since I was doing this in my free time after work and in-between getting ready for the next day this meant firing off a single compilation when I got home and testing it right before bed. This dragged out the process over many days, and, as I honed in on the final commit it seemed to be in the wrong location (the nearby fixes were for an unrelated driver), and, upon testing it, found that the commit I'd honed in on was indeed incorrect (networking was still broken, though I'd run into two fixed commits earlier). One thing had become clear: I had fucked up.
Thankfully I had been smart enough to take notes and save all of my kernels and their configurations in case this had happened. Since I wasn't sure what to do at this point I began to peruse the git-bisect man page and found the useful git bisect log to show me what I had marked, and, indeed, I had missed a commit in my logs (though I had thankfully saved the kernel). Alas, testing the commit showed that my bisect choices were correct; my notes had missed a single entry but were otherwise accurate. Thus I was left with the annoying task of running through each kernel and checking them for the bug based on my list:
54dbe75bbf1e - broken 307797159ac2 - broken ee090756962c - broken d972604f6f87 - broken c81c7012e0c7 - fixed 2a8a2b7c49d6 - broken aba16dc5cf93 - broken cf1acec008f8 - fixed ac4a5b52f597 - broken 1eb43fc75448 - broken 785e76d7a205 - broken 43f8b22450f0 - broken c08eebad4ac5 - broken a9910c088647 - brokenBefore doing this, however, I decided to get smart: it had become apparent after much testing that judging whether or not networking was working from whether or not I received a DHCP lease during system initialization was only accurate about 90% of the time for whatever reason, so I wrote a quick test that would work all of the time using ping -c 3. Test in hand, running through the pre-compiled kernels was quite fast and I quickly learned that commit cf1acec008f8 was actually broken, not fixed.
Fantastic as it was to find the broken commit, I was still left with a big problem: this process was taking forever (and I'd had a week vacation to slow it all down on top of that) and compiling another 6 kernels, since the ones past cf1acec008f8 were now irrelevant) would take another week at least. From perusing the logs I'd learned earlier about the git bisect run command which could automate testing when provided with the appropriate script, but there were multiple problems with that: first, I had to boot the machine into the kernel under test, which would kill the automated bisection; second, even if I could automatically reboot, the test would hang when trying to decrypt my hard-drive; third, I selected a few non-default options during kernel configuration and it wasn't clear how to programmatically select them. The third problem was feasibly solvable, the first one might be solvable but would require a non-trivial amount of work, and the second one seemed impossible to solve without an unacceptable security compromise. Automation was out of the question.
In this case my problem was that kernel compilation took half an hour and testing couldn't be automated, thus it took many hours to compile and run tests. Yet I was absent sleeping or working most of the day while my machine idled, but how was I to put that time to good use? I couldn't know which kernel to compile next...if only I could speculate which kernel to compile I could pre-compile it while I was at work then knock out two tests in short order. Alas, I had no means of speculation as both possibilities were equally likely! Then it struck me: why speculate? I could instead compile both kernels if they were equally likely and then discard the unneeded one, thus I'd be enabled to perform two tests at once without the need to speculate. Indeed, this logic could be extended to 3, 4, or even more tests! Eureka!
As a visualization, consider bisecting the following theoretical series of commits named "1" to "64" after their chronological order; this would take 6 steps and can be represented as a binary tree:
Mathematically, the number of tests n that can be run over a given time period t is given by taking the floor of the base 2 logarithm of t divided by average time of compilation c:
Configuration automation in hand I wrote a quick test script and, after a couple of tries, of course, was able to crank out 2 tests worth of kernels, and another 4 tests (the remaining amount) of kernels the next day while I was at work. Below is a (simplified) version of the script I was able to use:
forward_compile() { local commit=$(git log --oneline -n 1 | cut -f 1 -d ' ') local log="${LOG}-${commit}" # Build kernel. # Save results. # Compile forward. if [ $1 -lt 1 ]; then return fi git bisect log > "${log}" git bisect fixed forward_compile $(($1 - 1)) git bisect replay "${log}" git bisect broken forward_compile $(($1 - 1)) git bisect replay "${log}" } RESULTS="fwdcmpl" LOG="${RESULTS}/.tmp_bisect_log" # Parse arguments. if [ $# -lt 1 ]; then echo "USAGE: $0 DEPTH" echo " DEPTH: Tree depth to traverse (2^DEPTH builds will be done)." exit 1 fi # Perform forward compilation. forward_compile $1I was lucky that, during the second run, there were exactly 8 commits left so I did not need to test corner-cases. Thus I was able to finish testing 6 kernels in two days rather than six, giving me the following list:
54dbe75bbf1e - broken 307797159ac2 - broken ee090756962c - broken d972604f6f87 - broken c81c7012e0c7 - fixed 2a8a2b7c49d6 - broken aba16dc5cf93 - broken cf1acec008f8 - broken 15c480efab01 - broken 6e0bb04d0e4f - fixed 6a5d39aa9ac2 - broken 9a07efa9aea2 - broken 31fabbee8f5c - fixed 05212ba8132b - fixed
Alas, it had taken me so long to run all of these tests that the latest stable kernel had moved all the way to 4.18.14. I decided to test this kernel, only to find that networking now worked again! Someone else must have discovered the bug and backported the patch before I could, the bastard! My efforts to find the bug fix were thus rendered futile, but at least I learned a useful trick along the way.
I never did hear back from the maintainers, though I did update them letting them know that the bug had been fixed; annoyingly, my message to the mailing list was rejected with the error "Your address is not liked source for email". Rude! Things were working again, though, and I wasn't keen on fighting the mailing list at this point, especially when I could be generating visuals for this blog...hopefully I don't need to upgrade my kernels before I finish writing it.