February 14, 2006

On No DHCP reply received errors

This is my inaugural entry on Linux issues into my weblog. Briefly, though I have never written much about my current work on my website or in my weblog, I will say that I work as a systems administrator for a VLICA (Very Large Industrial Corporation of America). A bit more specifically, I administer a very large set of servers which are used for seismic data processing for the prospecting of oil and gas deposits. I also help administer much smaller clusters that are used for reservoir simulation studies and for visualization. Since all the easy oil and gas in the world has been found and dug up, we in the oil and gas industry now have to work for our $60 per barrel oil and $10 per 1,000,000 cubic feet of natural gas.

Recently, we at VLICA added 1,000 computers to our seismic exploration cluster. We have used Redhat Linux from day one of cluster building. To image such a large number of machines, we use a standard Redhat technique called "Kickstart", whereby you hook a USB floppy drive and a CD drive to a plain Jane server with nothing on it. You power the machine up and tell the BIOS to look for the floppy drive, then the CD-ROM for OS's in that order. The floppy has the instructions on where to look for a web server which in turn contains an image which tells how to image the server. The CD-ROM contained a 1GB ethernet driver which aids and abets the process of getting the server on the VLICA network.

Briefly, we were able to image 99% of our servers with no problem. At the very end of our imaging process, I was told by my boss to reimage some older servers which were exactly like the new servers (same model, same manufacturer, using the same network switches which had the exact same network port settings), but which had older versions of Redhat Linux on them. Update them, my boss said. And so I did.

Or so I tried. I hooked up the kickstart floppies and 1GB ethernet module CD-ROM to the servers and powered them up. The servers would POST (power on self test) and start the imaging process, but when it came to getting the machines on the network I would see the network driver being loaded from the CD-ROM but found I kept getting a strange message on the Linux consoles saying:

pump told us: No DHCP reply received.

As those of you who have ever imaged a Linux machine know, the point in the install process then leads to the prompt where you are expected to manually input the networking information, including the IP address, subnet information and so forth. For those of you who are not familiar with Linux, the kickstart process is a form of automation where the install looks for a file called ks.cfg which automatically answers the questions to which a manual install would require answers from the installer.

So what to do? Like many sysadmins, I was puzzled by this sour and unexpected turn of events. I naturally went to Google and punched in the term "pump told us" "no DHCP reply received" and started looking at the various mailing lists where this problem had been encountered before. Here is a partial sampling of such pages:

Here at Redhat

Here is one mailing list whose answers I found at several sites. For the record, I will repeat this fellow's advice as it is fairly good advice:

Could be any number of things. Basically, it's telling you anaconda
can't renew the dhcp lease. Examples of things that might cause it:

- Listing the wrong interface in ks.cfg (try eth1, if you have 2 nics)
- portfast being disabled on your switch (causes STP delays past the
anaconda threshold)
- dhcpd not running on dhcpd/pxe server (unlikely if you already
grabbed the initrd)
- not using the correct driver (what is your nic?)
- not having the correct driver on your ramdisk (try initrd-everything)

Just a few of the things that have bitten me over the years in our
pxe/kickstart/nfs build environment.

--
Jason Dixon, RHCE
DixonGroup Consulting
http://www.dixongroup.net

Now then, the ideas and advice given in the links above are all very good and worthwhile. I would highly recommend investigating what they have to say. However none of these items worked for me. My networking was just the same as every other server we had imaged, but I went ahead and checked all the network connections anyway - bounced ports, reseatred DRAC cards (the servers in question being imaged were Dell PowerEdge 1855 "blade" servers), and so forth, but again nothing worked.

Finally I went to my boss. He suggested looking at my boot floppies to see if there was nothing wrong with them. Briefly, we had ordered such a large number of computers that one idea we had was simply to have the vendor image them for us. Technical issues which I won't get into prevented this idea from being acted on and we had plenty of experience with imaging large numbers of computers ourselves. So we simply ordered something like 200 boot floppies and when we were finished imaging those machines, we edited the boot floppies so that they had a new server name on them and reused them. There should have been nothing wrong with these boot floppies I was using.

But there was. It turned out that the boot floppies I was using were apparently edited using a Windows Wordpad, Notepad, or Textpad program. Using these programs instead of using the classic Unix / Linux vi editor on a linux boot floppy added a ^M pair of characters to the end of each line of the ks.cfg file., like this:

zerombr yes^M
clearpart --all --initlabel^M
part raid.01 --size=120 --ondisk=sda^M
part raid.03 --size=8000 --ondisk=sda^M
part raid.05 --size=2000 --ondisk=sda^M
part raid.07 --size=2000 --ondisk=sda^M

Those of you who are familiar with ks.cfg "kickstart" files will recognize the code above as being from a kickstart file. Briefly the code above is supposed to zero out a master boot record and put one on, clear a partition table and write new partitions for the first scsi disk on a server. There are 4 raided partitions with their partition sizes in megabytes.

It turned out that my boss was correct. These boot floppies were probably edited with a Windows program and had the ^M characters at the end of each line. That threw the kickstart process for a loop when it came time to look for the kickstart process to look for the CD-ROM, load the ethernet module, and look for a DHCP server to get an IP address so that it could get on the network. Incidently, if one were to put the boot floppy on a Windows machine, these ^M characters would not show up. They only were viewable when you looked at the boot floppy with one of the Linux virtual consoles (the alt-F2 console if I remember correctly) during the kickstart process. To see these characters in the ks.cfg file, get onto the virtual console while your kickstart process is in progress, mount the floppy drive, change directory to the directory where the ks.cfg file is located and then vi to the ks.cfg file. You should see the ^M at the end of each line. If that happens to you, I would suggest simply getting a fresh boot floppy and creating a new kickstart file with a vi editor. When I did this, whalaa! My kickstarts worked perfectly.

Good luck and Regards

TMW

Posted by The Mighty Wizard at February 14, 2006 06:27 AM