An Effective GoAccess Setup For Static Sites

This site is a static site (using Hugo). Out of respect for your privacy I don't want to use Google Analytics, and I don't want to spend the money to host Matomo. Instead I use GoAccess to get basic statistics about my site. GoAccess is a program that parses log files to create traffic analytics. It filters most known crawlers already, but we can filter out more using grep.

My goal is to get a general overview of how many users visit my site and what pages they go to. I didn't spend a lot of time fine-tuning this to show every visitor. It will show visitors using a modern, up-to-date browser, and that's good enough for me. I'll note the places where visitors might be lost in case you want to edit those lines.

Finally, this isn't a copy/paste solution. You'll need to verify that these work with your nginx log format, and change the log and GoAccess path if needed.

Here's my shell script that I run to generate statistics:

zcat -f /var/log/nginx/emersonveenstra.net-access.* \
	| grep -Piv "Petalbot|Lighthouse" \
	| grep -Piv "\s(/css/|/js/|/icons/|/images/|\?)" \
	| grep -Pi "\s200\s" \
	| grep -Pi "Mozilla/5\.0.*Gecko" \
	| grep -Piv "Chrome/[1-7]|Firefox/[1-6]" \
	| grep -P "GET /" \
	| /usr/local/bin/goaccess -p /usr/local/etc/goaccess/goaccess.conf -a -

Breaking it down

Let's break this down line by line.

zcat -f /var/log/nginx/emersonveenstra.net-access.*

zcat -f tells zcat to send non-gzipped files to stdout directly, to pipe all my logs to grep.

 

grep -Piv "Petalbot|Lighthouse"

This line filters out the bots that GoAccess doesn't know about, plus any Lighthouse tests run on the site.

 

grep -Piv "\s(/css/|/js/|/icons/|/images/|\?)"

I don't want requests for any static files in the output, and my site doesn't use query strings.

 

grep -Pi "\s200\s"

I only want requests that have a 200 status code. Depending on how your logs are set up, this might also match response lengths of 200 bytes, but oh well.

 

grep -Pi "Mozilla/5\.0.*Gecko"

This is where we might exclude some real users. All modern browsers are going to have both Mozilla/5.0 and Gecko somewhere in their user agent. The goal with this is to filter any command line tools or exploit scanners. However, this will also block most social media/chat platform link bots (Slack, Mastodon, Facebook, etc.). This is fine with me, but if you want to see where your site is being shared, add those UAs to the grep pattern. Lynx/elinks will also be excluded with this.

 

grep -Piv "Chrome/[1-7]|Firefox/[1-6]"

Again, we're in the territory of excluding real users here. A lot of bots that fake user agent strings will use outdated versions of Chrome or Firefox. It will also exclude real users that are using browsers several versions out of date. This isn't a huge problem because both of them autoupdate. On my site, this is the most helpful filter, so I don't mind missing a few visitors because of it.

 

grep -P "GET /"

It's a static site, so any HTTP requests other than GET aren't legit requests.

Conclusion

I hope this helps people looking for privacy-friendly ways to look at their site analytics. It's written for Hugo, but it uses nginx logs, so it can be modified for any type of site.

Thanks for reading this! If you liked it, feel free to share it to places that will also like it. If you are so inclined, you can buy me a Ko-Fi as well. If you have any questions or comments, you can contact me in various ways, and I'll do my best to help you out.