Scalable Cluster Administration - Chiba City Approach and Lessons Learned
John-Paul Navarro, Narayan Desai, Remy Evard, Dan Nurmi
When systems a dministration activities need to be performed 100s or 1000s of times, cluster administrators look for ways of automating them. For example, automating Linux installation and configuration often involves using network services like DHCP and TFTP to configure network interfaces and deliver a boot image, and protocols like FTP, HTTP, and NFS to server files used in the build and configuration process. Combining these and other network services with remote power and console, cluster administrators can automate many administrative tasks. Scalable Cluster Administration addresses the challenge: what architectures and techniques can cluster designers use to automate cluster administration on very large clusters? In this paper we will describe an approach we used in the Math & omputer Science Division of Argonne National Laboratory on Chiba City, a 314-node Linux cluste. We will analyze the scalability, flexibility, and reliability benefits and limitations of our approach, and present ideas we are exploring to address these limitations.