Sunday, 22 July 2007

Feature/admin: Cluster file system support for server farms

Completed implementation and began production test of cluster file system synchronisation support for server farm architectures such as Fourmilab's. Cluster support is implemented in the new {\tt Cluster} module, through functions such as {\tt clusterCopy}, {\tt clusterDelete}, {\tt clusterMkdir}, etc. When a database file or directory is modified, immediately after the modification is made, (for example, after the {\tt close()} when writing back a file), the corresponding cluster function is called with the full path name of the modified file. This then calls {\tt enqueueClusterTransaction} with the specified operation and path name, which creates one or more synchronisation transaction files in the {\tt ClusterSync} directory, within subdirectories bearing the names of the servers defined in ``Cluster Member Hosts''. (Transactions are never queued for the server executing the transaction, nor for servers named as cluster members for which no server subdirectory exists. This allows you to have identical directory structures on all servers, or to exercise fine-grained control over which servers are updated automaticallly [for example, if you wish to reserve one server for testing new releases and not have changes made on it propagated back to the production server]).

Synchronisation transaction files are named with the current date and time to the microsecond, a journal sequence number which is incremented for each transaction generated during a given execution of the CGI application (to preserve transaction order in case the time does not advance between two consecutive transactions), and for easy examination of the synchronisation directory, the operation and path name, the latter with slashes translated to underscores. The contents of the transaction file is a version number, the operation, and the full path name.

Actual synchronisation is accomplished by a separate, stand-alone program, {\tt ClusterSync.pl}, which runs under group and user {\tt apache}, which is the owner of the {\tt ClusterSync} transaction directory and its contents. This program is started automatically from the {\tt init} script and runs as a daemon, saving its process ID in a {\tt ClusterSync.pid} file in the {\tt ClusterSync} directory.

When a synchronisation transaction is queued, the CGI program sends a {\tt SIGUSR1} signal to the {\tt ClusterSync.pl} process, which then traverses the server subdirectories, sorting the transactions into time and journal number order, and attempts to perform the operations they request. Synchronisation operations are performed by executing {\tt scp} and {\tt ssh} commands directed at the designated cluster host, which must be configured to permit public key access by user {\tt apache} without a password. If the synchronisation operation fails with a status indicating that the destination host is down or unreachable, the host is placed in a \verb+%failed_hosts+ hash with a timeout value of ten minutes from the time of failure. Synchronisation operations for that host will not be attempted until the timeout has expired, which prevents flailing away in vain trying to contact a down host over and over, possibly delaying synchronisation of other cluster members which are accessible. In the absence of a signal indicating newly-queued transactions, {\tt ClusterSync.pl} sweeps the transaction directory every five minutes to check for transactions queued for failed hosts which should now be retried due to expiry of the timeout.

All of the directory names, signal, and timeout values given above are specified by items in the ``Host System Properties'' section of the configuration; I have given the default settings, which should be suitable in most circumstances.

You can check whether two cluster hosts are synchronised by logging into one host, say {\tt server1}, and then running a command like:

    rdist -overify -P /usr/bin/ssh -c \
        /server/pub/hackdiet \
        server0:/server/pub/hackdiet

This will report any discrepancies between the database directory trees on the two servers. If the servers are synchronised, you should see only a ``need to update'' message for the {\tt ClusterSync/ClusterSync.pid}, plus any synchronisation transactions queued for failed servers awaiting retry. This operation is non-destructive and requires only read access to the database directory.

No comments: