As Splunk PS Consultants, we often perform so-called health checks, in which we examine the customer’s Splunk installation and document the opportunities for optimization. One of the issues that occurs in almost every health check is the sub-optimal configuration of source types. This sourcetype configuration defines very important aspects of processing incoming data in Splunk, including linebreaking, i.e. splitting large blocks of data into individual events, and timestamp recognition. Splunk comes with some default settings that try to cover these aspects as well as possible, even without explicit configuration. However, there is always the risk that the default settings do not detect all special cases in the logs 100% correctly, and on the other hand, the default settings are defined to apply to as many cases as possible – which is anything but performance-optimized. Research at Splunk has shown that in some cases, the correct configuration of the source type for a defined amount of data could be reduced by almost 75%!
For each source type there should be a corresponding configuration in a Splunk installation. The following parameters represent the best practices for defining source types, more precisely for configuring line breaking and timestamp recognition – additional parameters for other aspects are of course possible. In this article we present the basic parameters, in the following articles we will show how to handle special log formats using concrete examples.
The first 6 parameters are used in the props.conf on the splunk instance that performs the parsing phase, so in most cases an indexer. If the data passes through a heavy forwarder on its way from the source to the indexer, the parsing phase is usually performed on the RF: in this case, the corresponding configuration must be rolled out on the RF and not on the indexer:
TIME_PREFIX: a regular expression that describes at which point in the event the timestamp starts to be used as _time.
MAX_TIMESTAMP_LOOKAHEAD: The length of the timestamp.
TIME_FORMAT: The format of the timestamp in the strptime() notation (see https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Commontimeformatvariables).
SHOULD_LINEMERGE = false. Should always be at false unless there are compelling reasons not to.
LINE_BREAKER: a regular expression that specifies the boundary between two events. Default ([[\r\n]+), i.e. one or more line breaks.
TRUNCATE = 999999. Protects long events from being cut off. The value 0 prevents truncation completely.
The next 2 parameters are used in the props.conf on the splunk instance that executes the input phase, so in most cases a universal forwarder:
EVENT_BREAKER_ENABLE: Enables the process that can detect boundaries between events on the Universal Forwarder (6.5 and later). As soon as this process is active (per source type), the forwarder can switch to the next indexer during LoadBalancing, even if it has not yet detected an End of File.
EVENT_BREAKER: a regular expression that specifies the boundary between two events. Default ([[\r\n]+), i.e. one or more line breaks.
For very long multiline events, the following parameter helps to control the desired number of lines (again, in the parsing phase):
MAX_EVENTS: Maximum number of lines in a multiline event, further lines will be discarded. Default: 256.
So we are on the safe side if we start each configuration for a sourcetype with the following template: