Distributed Parsing

Sawmill can improve the performance of log processing, database builds or updates by distributing log data in chunks, to parsing servers, which are special instances of Sawmill providing parsing services. Parsing servers listen on a particular IP and port for a parsing request, receive log data on that socket, and deliver normalized data back to the main process on that socket. Basically, this means that a "-a bd" task can farm out parsing to multiple other processes, potentially on other computers, using TCP/IP. When used with local parsing servers, this can improve performance on systems with multiple processors.

The distributed parsing options can be changed through the web interface in Config -> Log Processing -> Distributed Processing. The section below describes how they are arranged in the internal CFG file for the profile; the meanings of the options in the web interface are analogous.

The log.processing.distributed node in profile .cfg files controls the distribution of log parsing to parsing servers.

The simplest value of log.processing.distributed is:

      distributed = {
        method = "one_processor"
      }

In this case, Sawmill never distributes parsing; it does all parsing in the main -a bd task.

The default value of log.processing.distributed is:

      distributed = {
        method = "all_processors"
        starting_port_auto = "9987"
      }

In this case, on a 1-processor system, Sawmill does not distribute processing, this acts as "one_processor". On an N-processor system, Sawmill spawns N+1 local parsing servers, and distributes processing to them, and terminates them when it is done building the database. The first server listens on the server specified by starting_port_auto; the next one listens on starting_port_auto+1, etc.

If only some processors are to be used for distributed processing, it looks like this:

      distributed = {
        method = "some_processors"
        number_of_servers = "4"
        starting_port_auto = "9987"
      }

This is similar to "all_processors", but uses only the number of processors specified in the number_of_servers parameter.

The final case is this:

      distributed = {
        method = "listed_servers"
        servers = {
          0 = {
            spawn = true
            hostname = localhost
            port = 8000
          }
          1 = {
            spawn = true
            hostname = localhost
            port = 8001
          }
          2 = {
            spawn = false
            hostname = wisteria
            port = 8000
          }
          3 = {
            spawn = false
            hostname = wisteria
            port = 8001
          }
        } # servers
      } # distributed

In this case, the parsing servers are explicitly listed in the profile. Sawmill spawns those where spawn=true (which much be local), and also shuts those down at completion. Those where spawn=false must be started explicitly with "-p {profilename} -a sps -psh {ip} -psp {port}". In this example, two of them are local (servers 0 and 1), and two are remote (on a machine called "wisteria").

This final case can be used to distribute log processing across a farm of servers.