cover image

This article comes from actual project experience where access log files were in the same database as the day-to-day application transactions which resulted in a constant and daily performance degradation that was not sustainable. Log aggregations were done on the production database as well - during the night. When the project became global and did not only handle requests from the DACH area anymore, trouble was on the horizon.

Alert: Your database is causing you sleepless nights - you have already indexed every table that made sense for your queries. How to fix this?

Your app or website performance is at an all-time low. The problem usually is bad database design and software architecture. Whether it is within your job scope to fix it or not, you will need a very quick solution so the database does not stop your whole application. On a Unix server, you won’t need expensive tooling or new microservices within your app to fix this problem - the terminal and a script running during performance-critical times are able to handle your situation for a short while until you come up with a better all-around solution.

Below you will find two stop gap solutions, the first to monitor slow queries and email you, and the second to put in a quick fix solution until you can properly solve the issue. These solutions only need a handful of tools: an editor, credentials to the database, crontab access, and a transaction email provider (in this case mailgun).

First - Monitor

As mentioned above, the first step in fixing the problem is finding the performance-critical queries. To do this you are going to setup a monitor that emails you when the server is having issues. Most guides of this type will advise you to use the console mail command, which requires expert setup and often will suffer tremendous delivery problems due to anti-spam solutions. Instead, we will use mailgun and a curl command.

Mailgun allows up to 300 emails a day with no registration or domain setup to “authorized recipients”. By default, the email account it is setup with is an “authorized recipient”, but you can add more and they just have to verify that they want the emails. After you have set up a mailgun account, you can continue with the monitoring script.

Note: If you are generating more than 300 emails a day, you probably want to more proactively deal with issues as discussed in the second half of this article.

Let’s create a new executable file named monitorqueries.sh and fill it with the following content:

#!/bin/bash

# General settings
TOOLONG=60
USERNAME=root

# MySQL settings
HOST=127.0.0.1
PASSWORD=password

# Mailgun settings
#     At the bottom of https://app.mailgun.com/app/dashboard you will see
#     sandbox domains, click on this domain to get the following info.
MAILGUN_DOMAIN=https://api.mailgun.net/v3/sandboxf79e512e2fca44b898cac747b734feef.mailgun.org
MAILGUN_APIKEY=key-a7ab252022dadc298b37e4595fafe2a79
TO_EMAIL="Jack Approved <jack_is_approved@gmail.com>" # Must be "Authorized Recipient"
FROM_EMAIL="Jill Validated <jill_is_validated@gmail.com>"

pl=$(mysql -u$USERNAME -h$HOST -p$PASSWORD -e "SHOW FULL PROCESSLIST")
trimmedpl=$(echo "$pl" | sed '1d')

echo "$trimmedpl" | while read -r line; do
     typerunning=$(echo "$line" | cut -d$'\t' -f 5)
     if [ "$typerunning" == 'Query' ]
     then
        timerunning=$(echo "$line" | cut -d$'\t' -f 6)
        if [ $timerunning -ge $TOOLONG ]
        then
            id=$(echo "$line" | cut -d$'\t' -f 1)
            cmdinfo=$(echo "$line" | cut -d$'\t' -f 8)
            echo "Mailing slow query information."
            curl -s --user "api:$MAILGUN_APIKEY" "$MAILGUN_DOMAIN/messages" -F from="$FROM_EMAIL" -F to="$TO_EMAIL" -F subject="Slow Query Warning" -F text="pid: $id is running too long $timerunning >= $TOOLONG on the query: $cmdinfo"
        else
            echo "No queries took too long."
        fi
     fi
done

Now you simply run that script as often as makes sense for you via the crontab. For repeated execution every minute, you can type crontab -e in the terminal and add the script to your regularly executed jobs by adding the following line to crontab:

* * * * * pathtoyourfile/monitorqueries.sh

Save and close crontab. You now have your query monitoring script in place and should start receiving emails for queries you defined as being too slow.

Second - The Quick Fix

This quick fix applies to PHP-style deployments where you have one process per HTTP query and killing off a mostly irrelevant running MySQL query is the lesser evil to bad application performance overall. That said, keep in mind that if you decide to randomly kill running queries in your production database you need to gracefully handle what happens next - mainly user interface feedback and subsequent database changes. Keep in mind that you might be stopping a transaction that was the successor to something else which might break data links and cause further problems. Ye be warned!

You will:

build up a connection to the database on a remote server
execute a query that kills all queries that take longer than one minute and logs them into a file

The purpose of the script is twofold. On one hand, if you are not aware of which queries cause your performance drop, you will have them all in one file to be examined at your convenience. On the other hand, you automatically kill the queries that take too long so that that application performance does not suffer.

Warning: You will need an admin user in order to see the processes of all users and be able to shut them down. If you use a script like this in production, you should restrict the query by either excluding or including users, whichever makes more sense for your setup. In the example script, we exclude all queries running through the user admin.

The monitoring script shown above uses the command SHOW FULL PROCESSLIST which gets its information from the MySQL-internal table INFORMATION_SCHEMA.PROCESSLIST. A more fine-grained way to kill and log slow queries is to query that table. That way, specific users can be singled out, which is important for making sure that automatic tasks like backups work properly and don’t get killed which would happen if the above script actually ran every minute. Instead, it is better to query for a certain user that is responsible for all user requests incoming from your application or exclude those users that you created for automatic database tasks. In case you kill a query for an application user, make sure that the visual feedback shown to the user reflects the failed data transaction.

Note: You should adapt the time periods given to your needs - if a query that takes roughly 20 seconds is too long for your purposes, you will need to run the script at least every 10 seconds and kill queries that take longer than 10 seconds. Also note that a database connection is created every time you execute this script so for short intervals below one minute you might want to keep the database connection alive with a socket but this goes beyond the scope of our bash script. What you can easily do, though, is run an infinite loop within the script. The following is an example for a script that runs every 10 seconds for an infinite period of time, checks for queries taking longer than 30 seconds and that are not coming from the user admin, subsequently kills them, and collects the SQL information into a logfile.

#!/bin/bash
LOGFILE=/tmp/logexample
KILLQUERIES=/tmp/killqueries.sql
CHECKKILLS=/tmp/checkkills
USERNAME=root
HOST=127.0.0.1
PASSWORD=password
TOOLONG=30
CHECKDELAY=10

while :
do
  mysql -u$USERNAME -h$HOST -p$PASSWORD -e "SELECT INFO FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND = 'Query' AND USER != 'admin' AND TIME > 10;" >> $LOGFILE
  mysql -u$USERNAME -h$HOST -p$PASSWORD -e "SELECT CONCAT('KILL ',ID,';') FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND = 'Query' AND USER != 'admin' AND TIME > $TOOLONG;" > $KILLQUERIES
  mysql -u$USERNAME -h$HOST -p$PASSWORD < $KILLQUERIES > $CHECKKILLS
  sleep $CHECKDELAY
done

MySQL already has a logging feature for slow query logs which is disabled by default as it causes an additional overhead on the database. Ideally, you do not need it enabled by default if your database performance is good as is. You can use the above script to pinpoint which parts need fixing, but you should not keep accumulating this information in your production database without a person taking care of fixing the cause for slow queries.

Next Steps

You could write a script that moves the current file to a versioned file so you don’t end up with a log file that is several GB big.
If the tables that are being heavily accessed in the slow queries are not indexed yet, this is a logical next step that requires little effort. Partitioning the data by a relevant factor (in most cases date, but it depends on what you are querying the most) would be the next follow-up step to ease the burden on the database.
Use the knowledge of which queries take too long as a basis for an improvement in your architecture, both on the software and the database side.