This is the multi-page printable view of this section. Click here to print.
Log events
- 1: Log events overview
- 2: Errors & Warnings
- 3: Getting rid of Sense log noise
- 4: App performance monitoring
1 - Log events overview
Butler SOS can capture very detailed information about individual charts in apps - or just the high-level information about the apps themselves.
This page explains the different levels of engine performance data that Butler SOS can provide.
Three levels of log event data
Depending on the needs of the Sense admin, Butler SOS can provide three levels of log event data, ranging from high-level counters to detailed performance data for individual charts in apps.
The most high-level data is the “log event counters”, which can be enabled/disabled independently of the more detailed engine performance events.
The engine performance events are then divided into “rejected” and “accepted” events, with the latter being the most detailed.
It’s usually a good idea to start with the high-level metrics and then enable more detailed metrics if needed.
Info
The counters for top-level metrics and rejected performance log events are reset to zero every time Butler SOS is restarted, as well as every time counter data is sent to InfluxDB.
I.e. the counters are not cumulative over time, but rather show the number of events since the last restart/save.
This makes it easy to create Grafana dashboards that show the number of events during the selected time interval, for example.
Just sum the counter fields in the Grafana chart, and you’re done.
High level: Event counters
Event counters can be enabled independently of engine performance events.
The counters are configured in the Butler-SOS.qlikSenseEvents.eventCount
section in the Butler SOS config file.
When enabled, Butler SOS will count the number of log and user events that are received from Qlik Sense via UDP messages.
The counters are split across several dimensions, described in the reference section.
These counters can be used to get a general idea of which servers generate the most log events, which Windows services generate the most events, and so on.
A Grafana dashboard can look like this:
Medium: Rejected performance log events
If an event with source=qseow-qix-perf
does not meet the include filter criteria in the config file, it is considered a “rejected” event.
For rejected events a set of counters are kept, broken down by dimensions described in the reference section.
As both app id/name and the engine method (for example Global::OpenApp
) are included as a dimension in the InfluxDB data, it’s possible to create Grafana dashboards that show how long time each app takes to open.
A Grafana dashboard can look like this:
In the upper right chart above we can see that all apps except one open quickly.
Might be a good idea to investigate why the app “Training - Field indexing DEV” takes so long to open.
More Grafana examples here.
Detailed: Accepted performance log events
If an event with source=qseow-qix-perf
does meet the include filter criteria in the config file, it is considered an “accepted” event.
All accepted events will result in a set of detailed performance metrics being stored in InfluxDB.
The metrics are described in the reference section.
An example Grafana dashboard can look like this:
The upper left chart shows total “work time” per app, during the selected time interval.
Work time is the time it takes for the engine to do the actual work, like calculating the chart data after a selection is made by the user.
The chart shows that the app “Training - Field indexing” has a high work time.
This does not have to be a problem, but it’s worth investigating why the number is so high.
The bottom two charts show the work time per chart object, per app.
The chart tells us that two app objects take on the order of 5 seconds each to calculate.
Which objects are these?
Open the app in Chrome and use the “Add Sense for Chrome” extension to find out.
Turns out it’s the two tables (object 10 and 11) that take a long time to update - that’s expected (but not ideal!) as they contain a lot of complex data that takes time to index.
If the update times get even longer, it might be a good idea to look into the data model and see if it can be optimized.
2 - Errors & Warnings
Warnings lead to errors.
Errors lead to unhappy users.
Unhappy users call you.
Wouldn’t you rather get the warnings yourself, before anyone else?
Sense log events
Sense creates a large number of log files for the various parts of the larger Qlik Sense Enterprise environment.
Those logs track more or less everything - from extensions being installed, users logging in, to incorrect structure of Active Directory user directories.
Butler SOS monitors select parts of these logs and provides a way to get an aggregated, close to real-time view into warnings and errors.
The log events page has more info on this.
Grafana dashboards
Butler SOS stores the log events in InfluxDB together with all other metrics.
It is therefore possible to create Grafana dashboards that combine both operational metrics (apps, sessions, server RAM/CPU etc) with warning and error related charts and tables.
Showing warnings and errors visually is often an effective way to quickly identify developing or recurring issues.
Example 1: Circular Active Directory references
Here is an example where bursts of warnings tell us visually that something is not quite right.
50-60 warnings arriving in bursts every few minutes - something is not right.
There is also a single error during this time period - what’s going on there?
We can then look at the actual warnings (=log events) to see what’s going on.
In this case it turns out to be circular references in the Active Directory data used in Qlik Sense. Not a problem with Sense per se, but still something worth looking into:
Looking at that single error then, it turns out it’s caused by a task already running when it was scheduled to start:
Example 2: Session bursts causing proxy overload
This example is would have been hard or impossible to investigate without Butler SOS.
The problem was that the Sense environment from time to time became very slow or even unresponsive. A restart solved the problem.
Grafana was set up to send alerts when session count increase very rapidly. Thanks to this actions could be taken within minutes from the incident occuring.
Looking in the Grafana dashboard could then show something like below.
New sessions started every few seconds, all coming from a single client. The effect for this Sense user was that Sense was unresponsive.
The log events section of the Grafana dashboard showed that this user had been given 5 proxy sessions, after which further access to Sense internal resources (engine etc) was denied.
In this case the problem was caused by a mashup using iframes that when recovering from session timeouts caused race conditions, trying to re-establish a session to Sense over and over again.
Alerts
Both New Relic and Grafana includes a set of built-in alert features that can be used to set up alerts as well as forward them to destinations such as Slack, MS Teams, email, Discord, Kafka, Webhooks, PagerDuty and others.
More info about Grafana alerts here.
New Relic alerts described here.
3 - Getting rid of Sense log noise
Let’s say Butler SOS is configured to receive log events from Qlik Sense Enterprise, specifically warnings and errors.
In the best of worlds there will be few such events, but in reality there are usually more than expected. It’s not uncommon for a Sense cluster to generate hundreds or even thousands of error and warning events per hour.
By categorising these events, Butler SOS can help you focus on the most important ones, and ignore the rest.
What kind of log events are there?
Log events originate from the Qlik Sense logging framework, which is called Log4Net.
The framework is responsible for logging things to Qlik Sense’s own log files, but can also be extended in various ways - for example by sending log events as UDP messages to Butler SOS.
This is done by deploying XML files in well defined folders on the Sense servers, these files are called “log appender files”.
Log4Net assigns a log level to all log events, these are on of the following: DEBUG
, INFO
, WARN
, ERROR
and FATAL
.
The log appender files that ship with Butler SOS are configured to send WARN
and ERROR
events to Butler SOS, from the repository, proxy, engine, and scheduler services in Qlik Sense.
So, what Butler SOS receives is a stream of log events, with a log level of either WARN
or ERROR
, from the repository, proxy, engine, and scheduler services in Qlik Sense.
That’s all fine, but it’s not uncommon for a Sense cluster to generate way more warnings and errors than expected by the Sense admin person.
The challenge then becomes to focus on the most important events, and ignore the rest.
How to categorise log events
Let’s say there are 1000 log events per hour.
When looking at the log events, it turns out that 95% of them are either about
- people getting https certificate warnings when accessing the Qlik Sense Hub…
- …people getting access denied errors when trying to open an app…
- …or that Sense has detected something unusual (like a ciruclar reference) in an Active Directory user directory.
Those events are not good, but they are not critical either.
They are things that can be fixed in a controlled way, and they won’t bring the Sense cluster to its knees.
But there may also be a few log events that are related to the Sense central node not being able to connect to one of the rim nodes. Maybe the events occur at irregular intervals and therefore more difficult to detect.
These events are critical, and should be investigated immediately.
If Butler SOS is configured to categorise the certificate warnings as “user-certificate, the Active Directory warnings as “active-directory” and the access-denied warnings as “access-denied”, that information is then added as tags to the event data that is stored in InfluxDB.
This means that when creating Grafana dashboards, it’s possible to create a table that show the number of “user-certificate” events and the number of “active-directory” events - and the number of “other” events.
Another Grafana panel can show detailed event info for the “other” events, making it easy to investigate what they are about.
By categorising log events it is possible to get rid of the log noise that hides the important events.
Sample Grafana dashboard
A few charts and tables give a good overview of the log events.
Butler SOS Configuration
The following settings in the Butler SOS config file were used to categorise the log events that were used in the Grafana dashboard above.
Butler-SOS:
...
...
# Log events are used to capture Sense warnings, errors and fatals in real time
logEvents:
udpServerConfig:
serverHost: 0.0.0.0 # Host/IP where log event server will listen for events from Sense
portLogEvents: 9996 # Port on which log event server will listen for events from Sense
tags:
- name: env
value: DEV
- name: foo
value: bar
source:
engine:
enable: true # Should log events from the engine service be handled?
proxy:
enable: true # Should log events from the proxy service be handled?
repository:
enable: true # Should log events from the repository service be handled?
scheduler:
enable: true # Should log events from the scheduler service be handled?
qixPerf:
enable: true # Should log events relating to QIX performance be handled?
categorise: # Take actions on log events based on their content
enable: true
rules: # Rules are used to match log events to filters
- description: Find access denied errors
logLevel: # Log events of this Log level will be matched. WARN, ERROR, FATAL. Case insensitive.
- WARN
- ERROR
action: categorise # Action to take on matched log events. Possible values are categorise, drop
category: # Category to assign to matched log events. Name/value pairs.
# Will be added to InfluxDB datapoints as tags.
- name: qs_log_category
value: access-denied
filter: # Filter used to match log events. Case sensitive.
- type: sw # Type of filter. sw = starts with, ew = ends with, so = substring of
value: "Access was denied for User:"
- type: so
value: was denied for User
- description: Find AD issues
logLevel: # Log events of this Log level will be matched. WARN, ERROR, FATAL. Case insensitive.
- ERROR
- WARN
action: categorise # Action to take on matched log events. Possible values are categorise, drop
category: # Category to assign to matched log events. Name/value pairs.
# Will be added to InfluxDB datapoints as tags.
- name: qs_log_category
value: user-directory
filter: # Filter used to match log events. Case sensitive.
- type: sw # Type of filter. sw = starts with, ew = ends with, so = substring of
value: Duplicate entity with userId
- description: Qlik Sense service down
logLevel: # Log events of this Log level will be matched. WARN, ERROR, FATAL. Case insensitive.
- WARN
action: categorise # Action to take on matched log events. Possible values are categorise, drop
category: # Category to assign to matched log events. Name/value pairs.
# Will be added to InfluxDB datapoints as tags.
- name: qs_log_category
value: qs-service
filter: # Filter used to match log events. Case sensitive.
- type: sw # Type of filter. sw = starts with, ew = ends with, so = substring of
value: Failed to request service alive response from
- type: so # Type of filter. sw = starts with, ew = ends with, so = substring of
value: Unable to connect to the remote server
- description: Reload task failed
logLevel: # Log events of this Log level will be matched. WARN, ERROR, FATAL. Case insensitive.
- WARN
- ERROR
action: categorise # Action to take on matched log events. Possible values are categorise, drop
category: # Category to assign to matched log events. Name/value pairs.
# Will be added to InfluxDB datapoints as tags.
- name: qs_log_category
value: reload-failed
filter: # Filter used to match log events. Case sensitive.
- type: sw # Type of filter. sw = starts with, ew = ends with, so = substring of
value: Task finished with state FinishedFail
- type: sw # Type of filter. sw = starts with, ew = ends with, so = substring of
value: Task finished with state Error
- type: ew # Type of filter. sw = starts with, ew = ends with, so = substring of
value: Reload failed in Engine. Check engine or script logs.
- type: sw # Type of filter. sw = starts with, ew = ends with, so = substring of
value: Reload sequence was not successful (Result=False, Finished=True, Aborted=False) for engine connection with handle
ruleDefault: # Default rule to use if no other rules match the log event
enable: true
category:
- name: qs_log_category
value: unknown
4 - App performance monitoring
Users complain about an app being slow to open.
Or that a certain chart is slow to update.
Butler SOS can help you understand what’s going on by providing very detailed, real-time performance metrics for all apps or just a few.
For all charts or just some.
Knowing is better than guessing
Monitoring a Qlik Sense environment - or any system - comes down to knowing what’s going on.
If you don’t know what’s going on, you’re guessing.
And if you’re guessing, you’re not in control.
If you’re not in control, well - that’s a bad place to be..
Qlik’s performance analysis apps
Qlik has a couple of apps that can be used to analyse performance data from Qlik Sense Enterprise on Windows.
First, the Operations Monitor app ships with every instance of Qlik Sense Enterprise on Windows.
This app is the go-to tool for retrospective analysis of what’s going on in a Sense environment.
The app puts all the great features of Qlik Sense to good use on top of the Sense logs it has access to.
What apps were reloaded when, which reload tasks failed most, which users are most active, etc.
Then there’s the Telemetry Dashboard app.
It uses the same data as Butler SOS, but as it’s a Sense app it must be reloaded in order to get updated data.
Butler SOS on the other hand uses real-time delivery of the events from Sense’s Log4Net logging framework.
If you want to do retrospective analysis of detailed performance data, the Telemtry Dashboard app is an excellent choice.
This page provides additional info about the telemetry logging available in Qlik Sense.
Info
Qlik’s Telemetry Dashboard app requires you to enable performance logging for the Engine service in the QMC.
The effect is that you get a new set of Sense log files that are likely to quickly become BIG.
By using Butler SOS you can get the same metrics and insights, but without the need to enable performance logging in the QMC.
I.e. saving disk space and getting real-time data instead of data that’s hours or days old.
Which is better? That depends on your use case and what you want to achieve.
Lots of use cases
Categorisation of events can be done in different ways depending on what you want to achieve.
Here are some examples of what can be done using Butler SOS:
- You want high level insights into how long various app actions take for all apps.
- Turn on engine performance monitoring in general, but no need for app details
- Enable tracking of rejected performance log events. This will capture high-level metrics (such as time used and number of events) for all methods/commands used in the engine.
- Create a Grafana dashboard that shows the average time taken for each method/command.
- π You get immediate, visual insight into how fast (or slow) apps open.
- Users complain about an app being slow to open.
- Turn on performance logging for app open events from that particular app. Store data in InfluxDB.
- Create a Grafana dashboard that shows a histogram of app open times for that app.
- π You get hard numbers on how long the app usually takes to often, and what the outliers are.
- A certain chart is slow to update.
- Turn on performance logging for that specific app and chart. Store data in InfluxDB.
- Create a Grafana dashboard that shows a histogram of chart update times for that chart. Or add a new chart to an existing dashboard.
- Define alerts in Grafana so you get notified if rendering time for the chart exceeds some threshold.
- Use Butler SOS’ user events features to keep track of what users access the app when the chart is slow.
- π You know in real-time how many and which users were impacted, and get a notification when it happens.
- You suspect that an app has a poorly designed data model.
- Enable detailed performance monitoring for that specific app
- Create a Grafana dashboard that
- tracks the “traverse_time” metric. This metric shows how long it takes to traverse the data model. A poorly designed data model tend to have longer traverse times. Break it down by app object and proxy session.
- tracks the render time for each chart and table in the app. Slow charts and tables may be caused by a bad data model, but not always. Lots of (complex) data may also be a reason.
- π You get real-time insight into which objects are slow for what users, due to a bad data model in that specific app.
Performance log events
A “performance log event” is a log event that originates from the Qlik Sense engine, more specifically from the “QixPerformance.Engine.Engine” subsystem.
They are sent to Butler SOS as UDP messages, just like regular log events (using the same port).
The performance log events have INFO
log level in the Sense logs but still provide very detailed information about what the engine is doing.
Warning
The XML appender files that ship with Butler SOS don’t do any filtering of log events.
This means that all events from the “QixPerformance.Engine.Engine” subsystem are sent to Butler SOS, regardless of what they are about.
The filtering is then done (if configured) by Butler SOS.
For large Sense environments, there can therefore be a lot of UDP messages sent to Butler SOS.
This is usually not a problem, but at some point Butler SOS may become overwhelmed and start dropping messages.
Use the log message counter feature in Butler SOS to keep track of how many messages are received - this can be a good indicator of how close Butler SOS is to its limits.
Accepted vs rejected performance log events
Due to the potential high volume of performance log events sentΒ by Qlik Sense, Butler SOS has a filtering feature that can be used to only store the events that are of interest.
Events that meet the filtering criteria are called “accepted” events, and events that don’t are “rejected” events.
Accepted events are stored in InfluxDB, including detailed performance metrics for the event.
Details on what is stored for accepted events can be found in the reference section.
Rejected events are not stored in InfluxDB, but are still counted by Butler SOS.
This feature can be enabled/disabled in the Butler SOS config file. More info here.
Data available for accepted and rejected events
The full set of data stored for accepted events is described in the reference section.
Suggested workflows for different use cases
For some use high-level cases, such as identifying apps that are slow to load, the data for rejected events may be sufficient.
Other uses cases, such as monitoring the behaviour of specfic app objects (charts, tables etc), require the detailed data stored for accepted events.
The recommended, general principle is:
Start with the data captured via rejected events (counter and process_time), and then add detailed monitoring.
The detailed monitoring involves setting up filters to capture specific events for specific apps and/or app objects and/or engine methods.Also: make sure to have a reasonably short retention period (10-14 days are usually enough) for the InfluxDB database, as the performance log events can generate a lot of data that can easily affect the performance of the InfluxDB instance.
Grafana dashboard examples
High-level monitoring of app performance (rejected events)
Detailed monitoring of app object performance (accepted events)
A large set of metrics are stored for accepted performance log events, the examples below highlight just a few of them.