Calculate a Baseline for Unique Website Hits Using Bash | Lookaway Information

Documented by kbruder tech✞ᵀᵀᴹ on May 19, 2021
Last updated on May 21, 2021


odometer

Introduction

Remember when sites had a "hits" counter back in the late 90s? Those were the days. In modern times, every public IPv4 address and thusly every website gets bombarded with requests, both legitimate and malicious. This makes is very difficult to count how many unique, real life human beings are visiting a site from the point of view of the web server. Of course there are many solutions to this problem on the application layer, but this is Lookaway, and we don't save vast amounts of garbage data on our visitors or members.

In this example, we are looking at a single server website created with Lookaway CMS, running in on a production system. The OS is Ubuntu 20, and the webserver is Nginx. This method could be adapted across different Linux distros and web servers. It might even work in zsh! IDK, I'm a bash guy. If you are using load balancers, look at the one where all traffic for your domain goes first.


  1. Sample a Log

    In this example we are going off of the Lookaway documentation examples on how to setup your own Lookaway website. If your site is different, use the a log file that is specifically for your domain. Copy the latest nginx-access logfile to the temporary directory. The latest log file is live and will likely change while we are observing it. By making a copy we will be reasonably certain that it will not change during the analysis.

    $ cp logs/nginx-access.log /tmp
    

    Lookaway CMS Production Server - Ubuntu 20

    Learn how to deploy Lookaway CMS onto a public webserver using Ubuntu 20.04 and PostgreSQL 10.

    🌐 Configuring Logging - Documentation | Nginx
    This article describes how to configure logging of errors and processed requests in NGINX Open Source and NGINX Plus.

  2. Create a Crawler Bot Log Filter

    You might find a lot of search bots and attack bots in your website's access log. This is typical and nothing to panic about. We are going to want to filter things out that should not be counted as a hit. In this approach we are using the 'grep' command with the '-E' flag so we need a list of strings to match against. This list of strings will be lengthy and will likely need to change over time so why not set it as bash environment variable?

    This is my filter for now. Feel free to start with this one then change it to suit your needs. Add something like the following line to your bashrc file:

    Tip: This filter will not block any traffic to your site. It's just a shortcut for use on the command line.

    ~/.bashrc
    ...
    export NLOGFILTER='192.168.0.105|Petal|Google|AhrefsBot|Semrush|bingbot|crawler|CCBot|192.168.0.1|yandex|github-camo|DuckDuckBot|MJ12bot|TestBot|Baiduspider|NetSystemsResearch|IonCrawl|facebookexternalhit'
    

    Source: kbruder Tech

  3. Mess Around with the Filter

    # Visitors ######################################################################
    $ grep -E -v $NLOGFILTER /tmp/nginx-access.log
    ...
    xx.xx.231.188 - - [18/May/2021:09:07:03 -0700] "GET / HTTP/1.1" 200 6189 "-" "Mozilla/5.0 (Linux; Android 11; SAMSUNG SM-G973U) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/14.0 Chrome/87.0.4280.141 Mobile Safari/537.36"
    xx.xx.198.189 - - [18/May/2021:12:46:21 -0700] "GET /api/search?folderIds=0 HTTP/1.1" 400 154 "-" "lkxscan/v0.1.0 (+https://leakix.net) l9explore/v1.0.0 (+https://github.com/LeakIX/l9explore)"
    ...
    
    # Looks like missed a bot and might want to add it to the filter!  ##############
    
    # 99% Legit traffic ############################################################
    $ grep -E -v $NLOGFILTER /tmp/nginx-access.log | grep 200
    ...
    xx.xx.18.168 - - [11/May/2021:05:09:56 -0700] "GET /zine/information/lookaway-cms-html-templates/ HTTP/1.0" 200 56774 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    xx.xx.18.168 - - [11/May/2021:05:09:57 -0700] "GET /zine/information/lookaway-app-profiles/ HTTP/1.0" 200 16943 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    xx.xx.18.168 - - [11/May/2021:05:09:58 -0700] "GET /zine/information/lookaway-app-profile-views/ HTTP/1.0" 200 34229 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    xx.xx.18.168 - - [11/May/2021:05:09:59 -0700] "GET /zine/information/prepopulate-a-form-field-using-url-parameters-django/ HTTP/1.0" 200 13746 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    xx.xx.18.168 - - [11/May/2021:05:09:59 -0700] "GET /zine/information/prepopulate-a-form-field-using-slugs-and-integers-django-20210410/ HTTP/1.0" 200 16837 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    xx.xx.18.168 - - [11/May/2021:05:10:00 -0700] "GET /zine/information/new HTTP/1.0" 200 11394 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    xx.xx.18.168 - - [11/May/2021:05:10:01 -0700] "GET /zine/information/lookaway-app-profile-urls/ HTTP/1.0" 200 10725 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    xx.xx.18.168 - - [11/May/2021:05:10:02 -0700] "GET /zine/information/buttons-djangolookaway-cms/ HTTP/1.0" 200 15175 "http://lookaway.info" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3415.0 Safari/537.36"
    ...
    
    # Show only IPv4 addresses ####################################################
    $ grep -E -v $NLOGFILTER /tmp/nginx-access.log | grep 200 | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}"
    ...
    xx.xx.32.147
    xx.xx.32.147
    xx.xx.32.147
    ...
    
    # No more repeats #############################################################
    $ grep -E -v $NLOGFILTER /tmp/nginx-access.log | grep 200 | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" | sort | uniq
    ...
    xx.xx.220.118
    xx.xx.70.118
    xx.xx.16.220
    xx.xx.183.147
    xx.xx.83.136
    xx.xx.98.21
    ...
    
    # Count the lines #############################################################
    $ grep -E -v $NLOGFILTER /tmp/nginx-access.log | grep 200 | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" | sort | uniq | wc -l
    420
    

  4. Make a Script

    In order to get a baseline, we are going to need check in often to see how fast the number increases. The more data points, the better. So in order to manually check often let's make a script so we don't need to type in some insane line every time.

    Tip: Don't forget to turn on the execution bit with 'chmod u+x /path/to/script'.

    /home/user/loghits.sh
    #!/bin/bash
    # Lookaway Site Hits
    NLOGFILTER='192.168.0.105|Petal|Google|AhrefsBot|Semrush|bingbot|crawler|CCBot|192.168.0.1|yandex|github-camo|DuckDuckBot|MJ12bot|TestBot|Baiduspider|NetSystemsResearch|IonCrawl|facebookexternalhit'
    /bin/echo `/bin/date`, `/bin/grep -E -v $NLOGFILTER ~/logs/nginx-access.log | /bin/grep 200 | /bin/grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" | /bin/sort | /bin/uniq | /bin/wc -l`\ >> ~/logs/loghits.csv
    

    Source: kbruder Tech

  5. Add a Cron Job

    Checking your hits manually is fun at first. Eventually, though you are going to want to see the hits all times of day, week, month, and or year. It would be nice to have a spreadsheet showing the number of hits with a timestamp at regular intervals. Our friend Cron will help us with that. This example will count the total unique "legit" IPv4 addresses in the current log file and, along with the datetime, will append it to a log file called "loghits.csv". A new line will be added at the top of each hour.

    # m h  dom mon dow   command
    # Lookaway Site Hits
    0 * * * * /home/user/loghits.sh
    

    Source: kbruder Tech

  6. Adjust Log Rotation (optional)

    Keep in mind that each time the log is observed using our method, we are seeing the total unique hits in a single file and the data is stateless. Hence each observation yields a cumulative sum for the date range of the log file. Therefore each new logfile will start at 0 regardless of any log data that came before.

    Since we will likely be curious about the change in traffic over time, it could be handy to ensure the log uses time based rotation. This makes it easier to combine data because it resets to 0 at regular intervals rather than at arbitrary times as with using file size based rotation.

    The default configurations for logrotate distributed with Ubuntu 20 have this behavior, as far as I have seen, likely because modern systems have ample storage space for log files.

    Caution: Only used time based log rotation if you have PLENTY of storage space. A sudden spike in traffic could cause very large file sizes.

    🌐 https://linuxconfig.org/logrotate

  7. The Results

    After a day, we already have 24 data points. After a week we can start to see the busy times. After a year we will start to see the busy season. This graph was created using LibreOffice by simple importing the csv file and choosing the create chart option.

  8. Considerations

    Keep in mind that this is not going to give you the exact number of humans that visited your site since a given date. It is more like an index or a rough estimate at best. Some visitors, I have noticed, are visiting from proxies or reverse proxies or edges. Given the dynamic nature of HTTP traffic, results may vary.

    Insight is not so much about metrics as it is about the change in metrics against other observable factors. Seeing a spike in your data will not tell you what it is, but it will tell you need to get down there and take a look.

Conclusion

This approach will not give you precise metrics or superfluous data, but if you keep your filters up to date and check in regularly, you will have your finger on the pulse of your website and will know when you are getting spikes of genuine traffic. Like doctors say, "If you don't look then you don't know".


bitcoin:3MnhNRKgrpTFQWstYicjF6GebY7u7dap4u
Bitcoin Accepted Here

3MnhNRKgrpTFQWstYicjF6GebY7u7dap4u

Please donate some Bitcoin to our site. We can use it to keep improving the site and open up to more members. Any amount will help. Thank you.


litecoin:MT61gm6pdp9rJoLyMWW5A1hnUpxAERnxqg
Litecoin Accepted Here

MT61gm6pdp9rJoLyMWW5A1hnUpxAERnxqg

Please donate some Lite to our site. We can use it to keep improving the site and open it up to more members. Any amount will help. Thank you.