Logging optimization strategies in large applications

Logging is an essential piece of every software architecture, allowing for identification, debugging and management of unusual situations in production environments.

You are probably familiar with the concepts of logging, its levels (e.g. INFO, ERROR, etc.), its implementation in your everyday programming language, and most probably you leverage a logging framework in your projects.

But what actually happens when we do log, and what are the impacts on our application performance?

There’s much more to logs than levels and frameworks. A performant and centralized log management is fundamental for running today’s infrastructures and applications, and by planning our log strategy it’s important to take in account the possible shortcomings.

A quick view of how to manage your logs

A Quick Introduction to Logging Frameworks Principles

As mentioned above, logging frameworks provide an abstraction layer above the nitty gritty details so that developers can just use the API. For example, the following logging statements are pretty straightforward (they use the Java SLF4J framework):

LOGGER.debug("Customer put item {} in the cart", item.getId());
LOGGER.debug("Item {} price is {}", item.getId(), item.getPrice());
LOGGER.debug("Cart price is now {}", cart.getTotalPrice());

Most frameworks operate in the same way:

Each logging statement makes use of a logging level
The configuration file states a logging level
If the criticality of the statement’s logging level equals - or is above - the configured criticality, the log is written, otherwise it’s not.

Long Computations in Log Statements

Notice that in the statements above, parameters are method calls. Even if the log is not written - because the configured criticality is higher than the statement’s - the call will be executed. In the following statement, cart.getTotalPrice() is executed in all cases:

LOGGER.debug("Cart price is now {}", cart.getTotalPrice());

That’s fine in most cases, as executing those methods is quite fast. However, imagine that getting the cart’s total price requires complex logic e.g. calling an external service to apply discount. That means that whether logging occurs or not because of the debug level, time will be wasted to evaluate the price of the cart.

There are two ways to solve this:

Guard the statement with an if depending on the log level:

if (LOGGER.isDebugEnabled()) {
    LOGGER.debug("Cart price is now {}", cart.getTotalPrice());
}

Use a lambda expression in the statement. Lambdas are lazily evaluated, which means the cart price will only be computed if the logging level is debug:
```
LOGGER.debug(formatter -> formatter.format("Cart price is now {}", cart.getTotalPrice()));
```

Adding a guard condition also adds additional complexity: one needs to make sure that the guard’s logging level is the same as the statement’s. Obviously, it also makes the code more verbose and hence a bit harder to read.

The second way handles the problem gracefully, but the feature is not available in all logging frameworks. For example, Log4J 2 offers it, but at the time of this writing, Log4J doesn’t.

In all cases, the abstraction layer has taken a dent: developers need to be aware of that issue, in order for logging statements not to ruin the application’s performance.

Blocking Logging

Another issue related to performance is that by default, logging is blocking: when the runtime executes the log statement, the log gets written on the disk. The speed of writing onto the disk depends a lot on the underlying hardware:

The access time of a regular HDD is between 5,000 and 10,000 microseconds, while it’s from 35 to 100 for SSD. Of course, an HDD is less expensive and has a longer lifespan compared to an SSD of the same capacity.
If one makes use of the network e.g. because the destination file is on a Virtual File System - then writing can get really slow.

If the process of writing is slow, and it prevents the application from proceeding because it’s blocking, then logging effectively has become the bottleneck for the application. While it’s in general not the case in web applications, which are I/O bound, chances are much “higher” for CPU-bound applications.

In the scope of logging frameworks, appenders are components that are tasked with proper writing. Appenders hide behind the API, and are configured separately, so that they can be changed without any code update. To prevent blocking and its impact on performance, some logging implementations, as e.g. Log4J or Logback, offer so-called async appenders.

Such appenders use different implementations: some of them, as e.g. Logback, buffer log messages in a queue. A dedicated thread pop those messages from the queue, and write the associated message. Additional configuration options include:

The queue’s size
The logging level of messages, which is acceptable to discard when the queue starts filling up
The queue’s capacity threshold to start the discard process
etc.

Performance of Logs Metadata

In general, log messages are too “raw” to be really useful. Log files not only contain the log message, but also some kind of metadata to increase usefulness. Sometimes, metadata might actually be bigger than the message itself! Here’s a sample excerpt, taken from a Spring Boot application:

2018-12-17 13:56:54.906  INFO 1 --- [  restartedMain] c.e.configmgmt.demo.DemoApplication      : Started DemoApplication in 3.833 seconds (JVM running for 4.303)
2018-12-17 13:57:05.595  INFO 1 --- [nio-8080-exec-1] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring DispatcherServlet 'dispatcherServlet'
2018-12-17 13:57:05.595  INFO 1 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet        : Initializing Servlet 'dispatcherServlet'
2018-12-17 13:57:05.603  INFO 1 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet        : Completed initialization in 7 ms
2018-12-17 14:40:57.731  INFO 1 --- [       Thread-9] o.s.s.concurrent.ThreadPoolTaskExecutor  : Shutting down ExecutorService 'applicationTaskExecutor'

One can see the log message on the far right. Before the actual log message is metadata. In the sample, this metadata include:

The timestamp, down to the millisecond
The log level
The thread name
The fully-qualified Java class (abridged)

Other possible examples of useful metadata that can be written in logs are:

Any state related metadata, e.g. the connected user, if there’s one
Source IP/hostname of the coming HTTP request
The method the log message originated from
The source file the log message originated from
etc.

Because some of those options need to be computed by the logging framework, they have a definite impact on raw performance, and should definitely be avoided in that regard. Among them are information regarding the file the log comes from, the class, the method metadata, and even the line - great for debugging purposes.

Note that if the log message explicitly states any of those, then the value doesn’t need to be computed, and it has no negative impact on performance. However, it makes the generation of the log message harder for the developer, and it might become too cumbersome. Thus, if any of those metadata need to be included in the final log message, the trade-off between ease of use and performance needs to be carefully evaluated beforehand.

Logs Aggregation

Logs that sit on a machine are not very useful for production purposes. In order to investigate production issues, people working in Operations need to be able to get an high-level overview of things happening across multiple machines.

In regard log files, that requires having to read multiple log files located on different machines to correlate between events. Furthermore, Ops also need search capabilities more advanced than grep - across multiple files and machines.

To achieve that, one should look toward Log aggregation. A centralized logging service is provided, which aggregates logs from each relevant source. Here are some of the most widespread log aggregation solutions:

Probably the most widespread way to get the logs into the aggregation service is to scrape log files. For example, in the Elastic stack architecture, Filebeat is the component dedicated to watching log files and sending their content to Elasticsearch.

Additional Logs Metadata

When metadata was mentioned in the previous paragraph, it came from two sources:

Explicitly stated by the developer
Computed by the logging framework

There can actually be a third source: metadata added by the scraping component itself.

Consider a clustered application deployed on multiple nodes, each writing into its own log file. When log files are aggregated together in the centralized service, there should be a way to specify the host the log was generated on. There are two ways to achieve that: either the logging framework access the hostname, and writes it into the log file, or the scraping agent adds the hostname metadata on each log message. The result is the same, but the second option is way faster.

The scraping agent adding metadata has many applications, and can be used:

to distinguish between different log files on the same host
to set the environment e.g. production, staging, development, etc.
to add any kind of additional metadata that makes sense and can help toward analyzing production issues, such as the cloud zone e.g. DE-FRA-1, CH-GVA-2, etc.
etc.

Optimize Logs for Write or for Read

There are two ways to implement search on data:

Store the data in a structured way, so that queries are run relatively fast, although the storing itself is slower.
Store the data as it is, in order to make storing quite fast, but making querying slower as trade-off.

This quandary is known as “schema on write” vs. “schema on read”. There is a lot of content available comparing the two approaches, here’s one if you’re interested in reading further. In the scope of this post, suffice to say that while storing fast is important, it’s even more important to query fast: though the reading phase is done only when encountering production issues - which shouldn’t be that often - it’s crucial that users can resolve those issues as fast as possible. And to enable that, they need to execute queries and get results as fast as possible.

So, in order to query fast, logs needs to be stored in a structured way. Since a log message is in essence a string, it needs to be parsed, structured and then digested by the log aggregation service. At that point, we need to dive a bit into the details: we will use the Elastic stack as an example - because that’s what we are using at Exoscale - but the reasoning is quite similar for other technology stacks.

In the Elastic stack, the component dedicated to parsing a log message is called Logstash. Logstash can be seen as an ETL of sorts, which accepts data from sources, transform them accordingly, and then sends them to Elasticsearch. There are some differences with traditional ETLs, though: first, it’s not ran on demand, but is a listener service - other components send messages to it. Also, it doesn’t write to a filesystem or a database directly, but instead sends to Elasticsearch HTTP POST requests with a JSON payload.

Anyway, the core feature of Logstash is to transform an incoming log message into JSON, which will be the payload sent to Elasticsearch. Here’s an example log message:

2018-12-17 13:56:54.906  INFO 1 --- [  restartedMain] c.e.configmgmt.demo.DemoApplication      : Started DemoApplication in 3.833 seconds (JVM running for 4.303)

Through Logstash’s capabilities and a custom configuration, it can be transformed in the following JSON:

{
  "date": [[ "18-12-17" ]],
  "MONTHDAY": [[ "18" ]],
  "MONTHNUM": [[ "12" ]],
  "YEAR": [[ "17" ]],
  "time": [[ "13:56:54.906" ]],
  "HOUR": [[ "13" ]],
  "MINUTE": [[ "56" ]],
  "SECOND": [[ "54.906" ]],
  "level": [[ "INFO" ]],
  "threadName": [[ "restartedMain" ]],
  "class": [[ "c.e.configmgmt.demo.DemoApplication" ]],
  "message": [[ "Started DemoApplication in 3.833 seconds (JVM running for 4.303)" ]]
}

This JSON is the final payload that will be sent to - and ingested by - Elasticsearch.

Depending on a lot of parameters, such as the different “kind” of messages sent to Logstash, the configuration can become quite complex. As a corollary, the parsing of a message and its transformation to JSON takes time.

It might be necessary to design a clustered architecture to handle load peaks. While not necessarily bad or impossible, it makes the architecture more complex, and more expensive.

An easy alternative is to create log messages that already are in JSON format. In that case, one can ditch Logstash and directly send HTTP post requests to Elasticsearch via Filebeat - and chances are Elasticsearch is already built around a cluster.

Removing Logstash not only does simplify the architecture, it also removes a possible performance bottleneck.

To achieve that, the above logging message can be replaced with the following that doesn’t require parsing of any kind, and can be POST’ed directly to Elasticsearch:

{ "date": "18-12-17", "time": "13:56:54.906", "level": "INFO", "thread": "restartedMain", "class": "c.e.configmgmt.demo.DemoApplication", "message": "Started DemoApplication in 3.833 seconds (JVM running for 4.303)" }

Conclusion

It’s a given that logging is necessary in modern systems, as it plays a crucial role when issues arise in deployed environments (i.e. not on a developer’s machine). However, logging can have a noticeable impact on the performance of a system, and in order to leave a maximum of resources to the business part of the system, they need to be as fast as possible. There are several options one can leverage for that:

If you can, use the capabilities of your logging framework to lazily do computations in log statements
Use asynchronous logging when possible
Avoid metadata that are expensive to process - or manage them manually
Output JSON directly to avoid transformations later on

While each of those tips might have only a marginal impact individually, one can leverage all of them to improve the overall performance of the system. If your system - or part thereof - is deployed in the Cloud, this can only benefit your monthly bill!