Friday, February 26, 2010

Splunk & Dynamic Meta Data

Splunk is infinitely configurable when it comes to consuming data produced by applications or devices. However one of the concepts I've been trying to get my head around configuration wise is the dynamic assignment of meta data. I still don't have a good working model.

Why? Well It's very useful to add context to events, for example if I'm indexing syslog data it could be very helpful to add additional key value pairs admin=bob, location=central. When you hit events at search time I can use these keys without employing any search magic.
> sourcetype=syslog failure | dedup admin

The reason you want to add the keys as meta data rather than just appending the keys to the actual raw event is that it messes with the integrity of the event not to mention it's readability. 

Prior to 4.x you could get smart with the header ***SPLUNK*** but it's now been relegated to the sinkhole.
Since I'm in a position to mangle events before they get indexed, I tried to append a line containing the keys and used a transform to extract the keys as meta data. 


[sysadmin]


REGEX =  sysadmin=(\w+)
FORMAT = sysadmin::$1
WRITE_META = true

This worked, however trying to remove the key(s) after extracting them failed dismally.


[clean]


REGEX =(?m)(.*)sysadmin=\w+$
FORMAT = $1
DEST_KEY = _raw

Ironically both transforms work but not at the same time. They appear to be mutually exclusive and I suspect my cleaning transform should be sending it's output to another queue.

I also tried to use the new SED command to cleanup after transformation, however this didn't work either since it appears transforms are processed after SED commands.

The Splunk header may be useful here since the monitor will remove it once it has evaluated it. So you could specify your meta keys, which Splunk will ignore, collect and index them with a transform. The monitor will then remove the line without any further config. This is obviously not an option but it will do the job.

In my discussions with Splunk-a-nista's I've noticed that they struggle to see why it would be useful to do this during index time and I've been advised to try all kinds of search time voodoo to try and achieve the outcome I'm looking for. I suspect the reason for this is that people are use to working with other peoples data, in this case you sit between the producer and the consumer and have the opportunity to enrich events with additional context.

Comments welcome.

Posted to the Splunk Forum

========= Update 5/03/10 ===========

The scheme works, it just relies on the fact that you can sequentially process transforms.
The tricky part is the cleanup, it's quite easy to trip yourself up with the multiline regex.


[clean]
REGEX=(?m)^((.*[\r\n]+)+)key1.*
FORMAT = $1
DEST_KEY =_raw