AWS Athena - Parsing apache, nginx and AWS ELB access logs

advices
devops
AWS Athena - Parsing apache, nginx and AWS ELB access logs

If something can go wrong, it probably will

We’ve encountered a problem on one of our servers. For obvious reasons, let’s name the main server, “X”. At one time, the systems’ performance and status needed to be checked, because some of the calls responded unexpectedly. We had many logs, “many” being an understatement, since X relied on communicating with other servers Y and Z. So we had a collection of servers for different functionalities, each having its own logs.

Needle in a haystack

Exporting all the logs:

  • apache access logs
  • apache error logs
  • nginx logs

What do programmers do when faced with a lot of data? They use patterns, in more geek-ish terms: Regex.

If you’re not familiar with the term, go here: https://en.wikipedia.org/wiki/Regular_expression and then come back. If you’re wondering if you need them, remember you have used them before to check if a string contains a string in SQL by using “like”, when you explode a string, etc.

Basically, we’ll use a Regex to look for a given pattern in a string, where the string is a single output log.

All roads lead to Rome

Before we get started, this will take a while, so if you’re in a hurry, skip to the last chapter and try the already build solution, in case it’s what you need. Then again, if you continue reading, you might learn more for the future.

We can parse strings in many languages, but most often, we select the one most suitable for us.

So we can analyse logs in: C, Java, PHP, Swift…. you get the point.

But since we’re on AWS, the most suitable for us would be Athena, provided by … AWS.

So now we have a sample on how to parse:

CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs (
request_timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code string,
backend_response_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/';

Change bucket and press run?

If that worked you wouldn’t be here. But you are, so what went wrong? Either you didn’t get any data, or the table’s structure isn’t exactly what you want. Either way blame Regex.

([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$

What does that mean? First of all, your logs might use a different format, so the parser didn’t find what it was looking for. Let’s build our own parser.

First, you should open https://regex101.com (there are other options, but like this one) in a new tab and put it on another window.

Take a log(single line) and use it as a test string. It will make your life easier if you know how the logs were written, but it’s ok if you don’t.

Step by step, goes the algorithm

Assuming you can’t get it all in your first attempt, we’ll breakdown the log on a step by step basis.

Basic things to know:

  • “-” will appear when a field is empty
  • AWS adds a date at the start in ISO format
  • date might appear twice
  • “, ], [, ),*,+ ( are reserved for Regex and need to be escaped, if you need them use “\” to escape them
  • ([^ ]*)means any character until you hit a space
  • ([.0-9]*) works great for double and IPs – unless logs are print numbers using “,”, if so, replace “.” with “,” (DON’T USE , for IP)
  • ([^ ]*)T([^ ]*)Z parses iso date
  • ([^]*) ignore all except , we use it to ignore bits of string we don’t need, we did mention date appears twice – so we don’t need it
  • ([^\n]*) will read the rest of the log
  • ‘serialization.format’= ‘1’ is internal for AWS and it tells it the logs are archived
  • you can pass a folder in the bucket and it will check subfolders (since logs are exports, use an auto generated string)
CREATE EXTERNAL TABLE IF NOT EXISTS webserver_proxy_access_logs (
request_date string,
request_time string,
all_else string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*)T([^ ]*)Z ([^\n]*)' )
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/'

This will work on all AWS logs, and can be used as it’s written above, the resulting table will support all sql. Store it (Cmd+D) until AWS decides to change something.

Are we there yet?

If you keep reading it means you have time and no one is acting like a 5 year old asking if we’re there yet.

Start with a small step, write the regex for the date and time section:

([^ ]*)T([^ ]*)Z ([^\n]*)

This looks for 3 fields:

  • the first is delimited by ” ” or “T”
  • the second is delimited by ” ” or “Z”
  • the last is everything left after the space after “Z “

Example applied to an access log:

2018-01-18T10:12:15.776196.233.218.12 – – [18/Jan/2018:10:12:14 +0000] “GET /p1=value HTTP/1.1” 200 15804 “-” “Mozilla/5.0 (Linux; Android 6.0; CRO-L22 Build/HUAWEICRO-L22) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.36” “-” “upstream: 127.0.0.106:80” 0.203 0.203

Assuming we need the IP, for … isolating calls from a given IP to check spam…. or any other reason.

Above, we gave a regex for IP: ([.0-9]*) or use the general one ([^ ]*)

By the way: ([.:0-9]*) can be used for IP:Port…. I’ll attach a table, unless a 5 year old pops up next to me.

So the regex will be: ([^ ]*)T([^ ]*)Z ([.0-9]*) ([^\n]*)

Go back to the site https://regex101.com/ and check the breakdown, a new group means a new column in the table request_IP string.

Let’s ignore the date, AKA skip all until “

  • ([^<somechar>]*) becomes ([^”]*)

New regex is:

([^ ]*)T([^ ]*)Z ([.0-9]*) ([^"]*) ([^\n]*)

do...while

So, in short:

  • add a new group
  • add the column
  • test at https://regex101.com/
  • run the create table statement on the logs
  • check the results
  • repeat

When you have added everything, you can remove the ([^\n]*) and the all_else field. If you’re not sure, leave them in, but make sure to remove the space before the group in the regex.

If the regex ends with ([^ ]*) ([^\n]*) change it to ([^ ]*)([^\n]*)

Last step: count the number of items that have the request_date null. These will be the failed attempts to parse that we ended up with. 200 out of 90 mil can be considered an acceptable error.

It's peaceful here

As promised, here is what we ended up with https://github.com/HyperSense-Software/aws-athena-apache-nginx-access-logs

Here are the patterns you might need:

([^<somechar>]*) – string from current index until the first <somechar>, the next one is an application

(<some_regex>){0,1} – extracts regex, but only if it exists – useful for multiple formats

([.]*)– any string from current index until the end

([^ ]*) – string from current index until the first space

([^\n]*) – string from current index until the first new line, used to build step by step

([^ ]*)T([^ ]*)Z  – extracts date and time, 2 fields

([:.0-9]*) – extracts IP(with/without port)

([.0-9]*):([0-9]*) – extracts IP and port, 2 fields

([.0-9]*)– extracts IP or double

([0-9]*)– extracts int

(\”[^\”]*\”)– string delimited by “

\”([^\”]*)\”– string delimited by “, but won’t take ” as part of the field

\”([^ ]*)([^\”]*)\” – string delimited by ” and separates the first word (use if to extract the request method when logs contain “GET /p1=v1”)

More more more

There are more regexes that can be used. If you have problems with the regex, you can use more SQL when making queries.

To download the queries used in this article please visit our aws-athena-apache-nginx-access-logs repository on Github.

If you found a pattern you needed and wasn’t written above, leave a comment and we’ll add it shortly.