Saturday, September 4, 2010

Data Visualization

One of the most useful ways to communicate or understand information is to visualize the data. In security this is vital due to the sheer volumes of data you often have to contend with i.e. from the secviz community.

Splunk can enable you to quickly find and parse data, however you often need to send the data to something like graphviz so that I can see it.

I've written an app that adds a new search command to Splunk that will enable you to generate a graph from your search data.

* | viz field1=ip1 field2=ip2 label=proto flatten=true file=/tmp/network1.png
 * | viz field1=ip1 field2=ip2 label=proto flatten=true file=/tmp/network2.png rankdir=RL

One of the nice features is that you can add graphviz options directly to the command i.e. rankdir=RL

The app can be obtained via, comments & patches are welcome.

Monday, July 26, 2010

Design Detection Heuristics

Benford's law provides a useful heuristic to detect data that has been produced by a person. This is very useful to detect fraud, tampering, vote rigging and other activities where one needs a little help. It appears thought that the application of Benford's law is more of an art than a science and rather than being the smoking gun one would like, it serves as the starting point for an investigation or a trigger for caution.

I've developed a Splunk App that adds a new command to the Splunk search language that calculates the first digit distribution, which can then be used to graph the field of interest.

* | benford field=price | table digit price benford

Other digits can be selected as follows

* | benford field=price digit=2 | table digit price benford

Here's some sample transactions I generated

The benford command will calculate the distribution of the first digit and produce a table, which can be graphed.

The following graph illustrates the digit distribution compared to the benford distribution.

The following graph was created using real transactional data.

Friday, February 26, 2010

Splunk & Dynamic Meta Data

Splunk is infinitely configurable when it comes to consuming data produced by applications or devices. However one of the concepts I've been trying to get my head around configuration wise is the dynamic assignment of meta data. I still don't have a good working model.

Why? Well It's very useful to add context to events, for example if I'm indexing syslog data it could be very helpful to add additional key value pairs admin=bob, location=central. When you hit events at search time I can use these keys without employing any search magic.
> sourcetype=syslog failure | dedup admin

The reason you want to add the keys as meta data rather than just appending the keys to the actual raw event is that it messes with the integrity of the event not to mention it's readability. 

Prior to 4.x you could get smart with the header ***SPLUNK*** but it's now been relegated to the sinkhole.
Since I'm in a position to mangle events before they get indexed, I tried to append a line containing the keys and used a transform to extract the keys as meta data. 


REGEX =  sysadmin=(\w+)
FORMAT = sysadmin::$1

This worked, however trying to remove the key(s) after extracting them failed dismally.


REGEX =(?m)(.*)sysadmin=\w+$
DEST_KEY = _raw

Ironically both transforms work but not at the same time. They appear to be mutually exclusive and I suspect my cleaning transform should be sending it's output to another queue.

I also tried to use the new SED command to cleanup after transformation, however this didn't work either since it appears transforms are processed after SED commands.

The Splunk header may be useful here since the monitor will remove it once it has evaluated it. So you could specify your meta keys, which Splunk will ignore, collect and index them with a transform. The monitor will then remove the line without any further config. This is obviously not an option but it will do the job.

In my discussions with Splunk-a-nista's I've noticed that they struggle to see why it would be useful to do this during index time and I've been advised to try all kinds of search time voodoo to try and achieve the outcome I'm looking for. I suspect the reason for this is that people are use to working with other peoples data, in this case you sit between the producer and the consumer and have the opportunity to enrich events with additional context.

Comments welcome.

Posted to the Splunk Forum

========= Update 5/03/10 ===========

The scheme works, it just relies on the fact that you can sequentially process transforms.
The tricky part is the cleanup, it's quite easy to trip yourself up with the multiline regex.

DEST_KEY =_raw

Monday, January 25, 2010

Goodbye RSS, welcome Twitter

Over the last year I've been trying to reduce the amount of information I touch on a day to day basis and consequently I've been avoiding Twitter. It was in vain, good bye RSS! You can follow my telic thoughts on marinusva.