I'll also try to follow this up with an actual post on when it makes sense to use si commands, in the next week.
There are two high-level perspectives that I chose between when I do summary indexing:
- Know exactly what search you want to run
Unlike the rest of Splunk, where you've got a ton of flexibility, you want your Summary Index to be as small as it can. The good news is that it is generally pretty easy to clear and backfill a summary index -- it just may take a while. If you're indexing 5 TB of logs, this probably isn't true, so it's all the more important to really know your requirements. My general process is to build a summary index only when I'm finally ready to productionalize my app.
An example of this would be a report on the top 10 firewall src-dst pairs for firewall denys per day. That's the report I'm tossing on the dashboard, that's all the data I'm going to be indexing.
- If you don't know, generalize.
If you're indexing a data source that contains several data points at different time intervals, grab them all. As detailed below, adding additional data points for a set time interval (e.g., avg(val) max(val) min(val) ) is essentially free. �So go wild.
I follow this approach for daily summaries of csv files, where I can split by only one or two fields, and pull out a huge amount of data.
Beyond the higher level guidelines, there are a few critical technical guidelines
- Ignore the si commands.
This is where you'll first turn for Splunk summary indexing, because the docs use them as examples. But I never, ever use them. There are benefits to using the si commands, and I'll hope to detail them in a future post, but they only add benefit in specific scenarios, and they add a complexity overhead to the summary indexing process. In essence, they will only work well if you're going to use -exactly- the stats command you're using to generate your index. If you change things around, you're going to find yourself trying to understand why on earth you can't read the contents of your index. My advice, don't start with them.
What should you do instead? Just use a normal stats command. And make sure to...
- Rename your fields.
If you're trying to do a summary index of
YourSearch earliest=-1d@d latest=@d | stats sum(HourlyTotal), avg(HourlyTotal)
make that:
YourSearch earliest=-1d@d latest=@d | stats sum(HourlyTotal) as DailyTotal, avg(HourlyTotal) as HourlyAverage
This has two benefits -- it allows you to consistently give things a logical name that you'll later understand, but more importantly, it will allow you to actually reference the command later. When you're looking through your summary index, it will turn all those sum(HourlyTotal) into a sum_HourlyTotal_ and you get into all manner of complexity with referencing it later
- Report as much as you can, splitting by as little as possible.
If I have a datasource that has the number of requests coming in to a web service and I want to archive Daily data, I will probably make my summary index something like:
MySearch | bucket _time span=1h | stats count as Req by _time, server | stats sum(Req) as DailyRequestTotal, avg(Req) as HourlyRequestAverage, max(Req) as BusiestHour, min(Req) as SlowestHour by server
What I will not do is:
MySearch | bucket _time span=1h | stats count as Req by _time, server,status_code,uri | stats sum(Req) as DailyRequestTotal by server,status_code,uri
Adding a number of fields to the start will linearly increase the size of your index, while keeping the same number of events. Adding fields to the end will increase the size of your index and the number of events exponentially. If you had 30 different uris, 4 different status codes and 5 servers, switching from the first to the second query would go from 5 events per day to 600.
A caveat: when I say anything on the front side is free, that's not entirely true. According to Gerald Kanapathy's presentation at the first user conference, the following statistical functions are free: count, avg, sum, stdev, max, min, first, last. The following are not free: median, percXX, dc, mode, top, list, value. He says, though, that if you've got less than 1k values per summary run, it won't be a problem. - Backfill your index to verify success
Backfilling the index will go through your old logs and fill your index. There used to be a script with the word backfill in the filename, that you'll readily find on Google if you're searching for the command -- that script is now old. The new method is to run:
cd /opt/splunk/bin/ && ./splunk cmd python fill_summary_index.py -app YourAppName -name "YourScheduledSearchName" -et -1mon@d -lt @d -j 8 -auth admin:changeme -owner YourUsername
The above should get you on the road to summary indexing success. I'll plan to do a follow-up post on where the si commands -should- be used (in essence, the areas where you can ignore half of what's above).