Back in November we started testing Spark on CentOS 7 and had run into a nasty issue where our nodes would appear to reboot and leave no core dump in /var/crash. This initially led me to believe that there might have been an issue involving hardware incompatibility or a power issue, it was the first time we had that combination of motherboard and CPU. We have also had a ton of issues with the Adaptec 7 series cards and them ejecting drives from the array under heavy I/O (a neat feature really), and sometimes the card hanging as well under high I/O. Strangely enough for that issue we found that disabling the write caching (specifically in front of SSDs) would stop the ejection of the drive. Still, not a great situation.
With all that information I approached the issue with a misguided idea of thinking it was a firmware issue - not without some well deserved doubts about the Adaptec controller from previous experiences. I updated the aacraid kernel module to the latest available, the controller firmware, the BIOS even and it appeared to be going well for a little while... then the crashes returned.
So at this point, I'm thinking its definitely not a RAID card issue. I remembered another CentOS 7 testing box had rebooted a week or two prior when under a light/medium workload. Maybe it could shed some light on the situation, if it captured a core dump. And sure enough it did (it had no RAID card), and there even was an upstream kernel fix for that issue (cfq-iosched: Fix wrong children_weight calculation). It appeared as though my investigation had come to a close, so I backported the fix to the latest CentOS 7 kernel that was available and asked everyone testing the cluster by shoving as many jobs down its throat as possible - the cluster smiled happily for a few hours so I went for a coffee, and then... more failures.
Finally, I decided to start looking into the aacraid module and kdump. I added aacraid to the extra_modules section in /etc/kdump.conf and added the root disk's UUID for good measure, then issued a panic hoping that the issue might have been a simple misconfiguration of kdump (something I should have tried from the beginning), alas it could not find the disk (and RAID controller) so I set it to default to a shell and tried again. Once in the shell I checked dmesg for aacraid messages, the driver loads but there is no output from the driver saying it found drives or a controller for that matter, and eventually fails after a long timeout with:
AAC0: adapter kernel failed to start, init status = 0
It was time to start digging deeper, I looked through the changelogs for the EL7 kernel for anything involving the aacraid module. Turns out there have been multiple patches and bugfixes over the past couple years for aacraid not playing well with the reset_devices boot flag. At this point I thought "What the hell, why not just remove it and see what happens", I removed the reset_devices from the KDUMP_COMMANDLINE_APPEND variable in /etc/sysconfig/kdump, restarted the kdump service (which rebuilt the boot image), and issued another panic.
This time.... SUCCESS! The drive was mounted and the dump began.
After all that extra work I gave myself (through misdiagnosis), it was the removal of 13 characters from one file that will enable me to see what is causing this box to panic - and almost forced us back to using CentOS 6.
As an aside, the issue really is with these RAID cards. After much more investigation and a ton of late nights with Pagerduty and I becoming the best of friends, we were getting nowhere. The RAID card would not only eject drives but actually become unavailable and bring the entire box down.
We reprovisioned with LSI cards and the nodes haven't had an issue since.