Why I hate Dell servers:
Every Dell machine which my clients have purchased and paid big money for has caused problems. I’m not very happy with my Dell experience overall. Note that I didn’t choose Dell. I recommended against it. The organization footing the bill chose Dell. I get to install and manage the Dells and get paid for my time. But I would prefer to get paid for time being productive and not fighting the hardware.
On the bright side: Future employers: I have LOADS of experience with Dell hardware and have found workarounds for many of their warts! :)
Now, pardon my rant as I blow off some frustration:
First was the sales process. I don’t want to have to haggle for a week to get a good price. But that’s what we did. And the price came down a fair bit. Probably not as much as the time it cost us though. I don’t want to have to pay extortionate prices for RAM or hard drives either. I hate that a 6 bay hot swap machine comes with blanks instead of drive trays. If you want more trays you have to buy them from Dell with marked up Dell drives. I understand wanting to only support drives known to work but tell me what model number that is so I can get them wherever I want and give me drive trays with the machine. If the machine has 6 drive bays it better come with 6 drive trays in those bays. It’s games like this…
We bought a memory upgrade from Dell for our 2970’s to bring them up to 32G of RAM. After installing the RAM and rebooting the computer said the memory configuration was not optimal and prompted me to press F1 to continue. It would then boot up just fine. But I can’t have the servers requiring human intervention for a reboot. So I had to figure out what the problem was. I called Dell support and it turned out that the BIOS did not properly support 32G without a BIOS upgrade.
We were told they supported up to 32G when we bought them but it turns out the BIOS they were shipped with didn’t properly support 32G. So…that’s broken at time of purchase in my book.
Every one of our Dell servers has required a BIOS upgrade. The 610’s would spontaneously reboot after a couple of months in operation at first. They all did it. Then I upgraded the BIOS. Now it has been at least 9 months since that happened and I hope it is cured. Now standard practice is a BIOS upgrade right out of the box. I really don’t expect to ever have to upgrade BIOS in a server. If I do that means it was broken when I bought it. Bugs don’t appear by themselves over time, they are there at time of shipment. Not only that but there is mainboard BIOS firmware, DRAC/BMC firmware, and RAID controller firmware all in need of updating. That’s just too much stuff requiring post-sale fixing.
As for the process of doing the BIOS upgrade there is room for improvement. First, I am happy that there are Linux executables for doing this. It used to be that only DOS binaries were distributed for stuff like this. But the process for obtaining and executing the upgrade is rather obtuse.
The first step is to download the BIOS update. I was given this url by tech support:
http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&r T&osl=en&deviceid=11598&devlib=0&typecnt=0&vercnt=11&catid=-1&impid=-1&formatcnt 362396
Wow. That’s a mess of a url. I don’t like to have to download the BIN file on a desktop or laptop and then scp the file over to the Linux server as it is inconvenient. We don’t run a web browser or any GUI desktop at all on our servers as it is a waste of resources and not best practice. But I pretty much need one to copy and paste that url and navigate the webpage it points to.
It would be nice if Dell provided a simple direct download link. Or at least didn’t wrap the Download button with a javascript function. If I am on my laptop I like to right click the download link on my laptop and select “Copy link location”, then paste the url into an ssh terminal on the server and pull the binary directly down to it. Currently when I right click the download button and copy the link I get:
javascript:downloadslink('http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN verDownloadManager.application?c=us&l=en&fileid=362790&fileloc=ftp://ftp.us.dell alse','PE2970_BIOS_LX_4.1.1_1.BIN');
Ugly and unusable. However, from this I can see that the actual path to the file is:
http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN
So on the server I can do: # wget http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN
and download the file directly onto the server.
Much more convenient. I can even type that by hand without copy and paste if I really have to. The firmware upgrade executables never work on CentOS. This is a gratuitous limitation since it is functionally the same as RHEL. I can usually just change one line in the shar file and make it work but I shouldn’t have to.
When I execute this BIN file it produces an error indicating that it wants another program called lockfile to be installed on the system. It took me a while to remember this program. I had seen it before somewhere. Turns out it is part of the procmail mail filtering program which we do not normally install onto our servers. Most people shouldn’t be installing that unless they need it as part of a mail server. I had to install it to get the file to run.
Then I find that I also have to install compat-libstdc++-33-3.2.3-47.3.i386.rpm
but at least the BIN file gives me a useful error directing me to install it. This is only needed for executables compiled against the old C++ library. Moving to the newer one (why wouldn’t they just use straight C for a firmware installer?) would remove a barrier to getting the firmware update done.
This is pretty sweet:
Continue? Y/N:y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE
UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
.../tmp/PE2970_BIOS_LX_4.1.1_1.BIN-6001-9159/./UpdRollBack: error
while loading shared libraries: libxml2.so.2: cannot open shared
object file: No such file or directory
Oops…looks like it is complaining that it can’t find libxml2.so.2 so I gess there is some XML nuttiness in this firmware somewhere. Installing libxml2 with yum resolved that.
Then the firmware update installed and I rebooted. Yay.
So that covers firmware.
The RAID card management tools leave MUCH to be desired as well. As far as I can tell, the MegaCli package is the way to manage the PERC from the command line in Linux. To work with it you have to hunt down the MegaCli-1.01.39-0.i386.rpm tools since the tools are proprietary to LSI and don’t ship with RHEL.
[omstorage stuff is the right way to do this but that isn’t clear at first]
Then you RPM install it and go looking for the software it installed. MegaCli is rarely used. Only when setting up disks. They didn’t call it megacli or something I might remember. They called it MegaCli64 (case sensitive) which is installed in /opt/MegaRAID/MegaCli/MegaCli64.
Then you have to figure out how to use it.
# /opt/MegaRAID/MegaCli/MegaCli64 Fatal error - Command Tool invoked with wrong parameters
hmm…ok
# /opt/MegaRAID/MegaCli/MegaCli64 --help Invalid input at or near token -
hmmm
# /opt/MegaRAID/MegaCli/MegaCli64 -h
whoah! This gets you a massive amount of cryptic command line options with no explanation as to their purpose. I have pasted the output here:
This is their idea of “help”. I’m a command line commando of 20+ years and this scares even me! It would have been nice if they at least tried to make it work somewhat like the Linux mdadm command or at least provided some examples of common use cases etc. Because of the oddity of this command various people out on the net have compiled “cheat sheets” to help poor souls like me figure out how to use this thing:
Usually I avoid using this command and just reboot the server into the BIOS and configure the RAID card from there but often it is not a convenient time for a server reboot. I also avoid it because it is so complicated and one wrong command can lose all of the data in the server. Yes there are backups which I would really rather not have to restore.
I needed to add a couple of disks on the fly and did not want to reboot. The command line I seemed to need and response it gave me was:
# /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4] -a0
Adapter 0: Configured the adapter!!
Not a very reassuring response. Configured it how with what? It would be nice if it said Added virtual disk number 4 as a RAID 0
since that is what that command told it to do.
Using the command: /opt/MegaRAID/MegaCli/MegaCli -LDInfo -Lall -aALL
I was able to verify that it had in fact created virtual disk number 4 as a RAID 0. However, I didn’t have a file to work with in /dev representing the disk. The operating system simply refused to see the disk so that I could actually do something with it. I spent some time trying to figure out why but couldn’t come up with a solution. So I called tech support.
Dell tech support people are always friendly and, thankfully, seem to be US based. That is a big help when the tech support person and I are yelling instructions at each other over a noisy datacenter on a mobile phone. They don’t always have the solution, though. In this case with the RAID controller I had added a disk and was trying to make it usable/visible to the OS. The guy first guessed that I needed to partition the disks. I explained that the disks were not visible to the OS to be partitioned. Then he guessed at some MegaCli commands which were not useful. Eventually I had to get off the phone and head out for an appointment. Later I got an email explaining that he had the solution: I needed to run partprobe. That command finds partitions. You can’t find partitions on disks which you can’t see. Way off the mark. Eventually it became more convenient to reboot the server. So that is what I did and the disks appeared. Problem solved, sort of. Although with this hot swap stuff it really should be possible to add disks on the fly. That’s the whole point.
Speaking of RAID controllers, we have a pair of identical R410’s. And they BOTH consistently produce these errors:
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
They produce these errors at a rate of around 10 per day throughout the day. Both machines produce the exact same error. Same hex codes, etc. Identical. I don’t think it is actually a drive failing because the chances of both machines failing at the same time in exactly the same way are slim. One of these machines had what looked like a RAID controller crash which lost data and didn’t do our filesystems any good.
Whenever I call Dell tech support I always wonder why it is that Dell’s phone system always asks me for the long service code number instead of the shorter service tag which is just the base-36 encoding (therefore much shorter) of the service code. Sometimes I have one but not the other on hand. They are clearly the same thing. Lots of people have even put up little webpages (which I have used) that will convert from one to the other for you:
Why would they ever ask for or deal in the long version and make me yell it at them over a mobile phone in a noisy datacenter?
Then the next person I talk to wants the service tag again even though I just told the phone system the service code.
Then the NEXT person wants to confirm the service tag.
At least they tend to understand the ICAO phonetic alphabet so we don’t have to haggle over whether I said b, c, d, e, g, p, t, v, z or 3.
I hate those pointless bezels that come with the machines. I try not to pay the small amount of extra money for them anymore because they just go in a pile. These machines sit in a datacenter, not a showroom.
Apparently there are at least two different kinds of DRAC: iDRAC Enterprise and iDRAC Express. I suspect they are exactly the same hardware, perhaps with different licensing or firmware.
My machines have iDRAC Express. iDRAC used to be something called BMC. Not sure why they changed the name. The iDRAC stuff is nice. It took me a while to get around to learning how to use it but it is worthy. Reminds me of some old systems I had worked with in the past such as Sun, HP, and even Pyramid which had service processors. I have long awaited the day that x86 servers got this feature.
However, it has some weird limitations and is expensive compared to the latest stuff from Supermicro. For example, it is odd that iDRAC Enterprise supports public key auth and Express does not. The DRAC is a little processor (MIPS or ARM on most platforms) running Linux or Busybox. Why not support public key? We do everything with ssh keys. Without public key auth I have another password to worry about.
A java applet for console in the DRAC web interface? With all of the 0-day exploits for the JVM they want me to have the Java plugin enabled in my browser? Why can’t I just VNC? Or RDP? Tunnel it over ssl or ssh if you must. The Java app is flakey. The JVM says “Downloading application” …after a couple of minutes that window will go away and be replaced by a window which says “Unable to launch the application.” It has “Ok” and “Details” as menu options. If I click details it says “Error: Malformed reply from SOCKS server” and a window full of XML. This happens sometimes. Restarting my browser doesn’t help. Hmm…I tunnel all of my web browsing through a SOCKS proxy with SSH -D. I have an exception for the 10/8 network which doesn’t get proxied. Works for accessing the DRAC web interface itself. But the console java applet is apparently somehow trying to use the proxy and failing. If I disable the proxy in firefox the console applet works again. I would really rather just VNC…
Once up and running he arrow keys don’t work in DRAC java console. This is a real problem in navigating BIOS and configuring things on, say, ESX console. Turns out that you have to do some work to get them working:
- http://ceph.github.com/sepia/drac/remote-console-keys/
- http://www.anchor.com.au/blog/2011/03/evil-hack-to-make-arrow-and-sysreq-keys-work-with-a-dell-idrac-kvm-and-linux-desktop/
This last url says:
The KVM software makes a connection back to the iDRAC on the standard VNC port (5900) (with the single use credentials that were provided to it in the .jnlp file).
At this point, you could easily be mistaken into thinking, “Ah, VNC, that’s got to work well right. Such a simple thing and all“. Unfortunately you would be mightily wrong :( .
Whilst the iDRAC is using the standard VNC port, it appears that the implementation has been somewhat customised.
So this is all based on VNC but Dell took standard VNC and fsckd with it! :(
All of the Dell DRAC SSL certs are the same with the same serial number. This causes firefox to freak out and not accept it. Have to delete cert8.db from firefox (stored cert cache) and restart firefox as a workaround.
The virtual media functionality in the DRAC doesn’t seem to work properly in Safari. I click virtual media and the little window where I can mount the media never shows up. Works ok in Firefox/Linux.
Sometimes the DRAC web interface gets confused and all of the menu items become labeled “undefined”. Have to clear cache and try again and it works.
I’ve been using the Dell DRAC console quite a bit lately to remotely install OS etc. It has terrible stuck-key/repeat problems. Typing slow helps but quite often it is simply impossible to enter a 10 character password. Others have had this problem:
This is mostly a function of network latency and the fact the protocol sends keydown/keyup messages. So if you get some latency longer than the interval between your keydown/keyup the keyboard auto-repeat starts.
I added a Dell MD1120 disk array to an R415 with redundant external SAS connections…BIOS complained:
Number of devices exceeded the maximum limit of the devices per quad
Please remove the extra drives and reboot system to avoid losing data
System has halted due to unsupported configuration.
Firmware upgrade fixed it. So again, it was broken when we bought it.
I was asked to evaluate the Dell H700 controller with Self-Encrypting Disks for an organization with very serious security requirements. You key the controller and then the controller keeps the key forever until manually cleared. So if someone steals the server they can boot it right up and get the data. You can’t configure it to lose the key on power-off and require re-entry. I called Dell and they took a couple of days to find the right people internally to ask and confirmed that this is the case. Not very useful. This is only useful if the server and the disks are separated.
I was trying to set up a new Dell R620. Got it racked, started to configure the DRAC, couldn’t ping it. Double checked everything, cabling, switch port, on the right VLAN, went through all of the DRAC options. Noticed a little message slyly hidden among the various config options:
NIC Selection: Dedicated: A require license is missing or expired.
WTF? Dell sold us hardware (NIC on the DRAC) which is completely useless without an additional license?
WHY!?!?!
Dell DRACs use Java and have vulnerable OpenSSL implementations which is driving our security auditors nuts. And we can’t fix it other than to turn off the web interface because Dell doesn’t yet have a fix. Turning off the web interface is fine as I prefer to use ipmitool and serial over LAN but a lot of my clients don’t want to do that. So they run vulnerable SSL and have Java enabled in all of their browser including the most security sensitive employees with the highest levels of access. “What could go wrong?”
My SuperMicro gear is SO much simpler. I’ve never upgraded BIOS or had IPMI or RAID problems on any of them. It just works.