Wrangling Gmail Filters

I’ve long been an advocate for using filters to improve the signal to noise ratio of email. Ideally you want this stuff to happen on the mail server, so that the filtering happens automatically, regardless of where you actually read your mail. I like to keep most mail that isn’t directly for me out of my inbox, and then automatically mark as read things that are noisy or just notifications / informational.

Gmail lets you create some pretty complex filters, but the UI for managing these can get quite cumbersome once you have more than a handful of rules. Fortunately gmail-yaml-filters exists to simplify the process.

I started by exporting my current rules as a backup, and then worked through the list duplicating them in the yaml syntax. I was able to take the 28 rules exported from Gmail and represent them in just 6 top level rules. Running gmail-yaml-filters on this file creates (more-or-less) exactly the same set of rules.

By combining rule definitions using the more operator the rules are much simpler to parse. For example, I like to label mailing lists and move them out of the inbox. By using more I can then selectively mark as read or delete.

- list: <list.name>
  label: "Some List"
  archive: true
  not_important: true
  more:
    - subject: "Some annoying notification"
      read: true

    - from: something-noisy@example.com
      read: true
      delete: true

This generates an xml file with 3 filters:

  1. Everything from the mailing list is labeled with Some List, and archived.
  2. If the subject matches Some annoying notification it will be marked as read.
  3. If the sender is something-noisy@example.com it will be deleted.

To build this inside Gmail I would need to remember to add all the conditions and actions for every rule - forgetting to add the list condition to the last rule would delete everything from that address, not just messages to that list.

It’s also easy to make fairly complex rules:

- from:
    all:
      - -work.com
      - -example.com
  more:
    - subject: 
        any:
          - webcast
          - webinar
          - workshop
          - scrum
      label: Webinars
      archive: true

    - has:
        any:
          - webcast
          - webinar
          - workshop
          - scrum
      label: Webinars
      archive: true

By not having any actions in the top level element, this creates two rules, which both include the not filter at the top.

Once you have a working filter set it only takes a few minutes to export it as xml, and import into Gmail. Technically you could give it access to do this for you but I don’t really trust anything to log into my email.

Athena Partition Projection

We can make sure Athena only reads as much data as it needs for a particular query by partitioning our data. We do this by storing the data files in a Hive folder structure that represents the patitions we’ll use in our queries.

s3://mybucket/data/year=2021/month=06/day=27/file1.json
s3://mybucket/data/year=2021/month=06/day=27/file2.json
s3://mybucket/data/year=2021/month=06/day=28/file1.json

We can then create a table partitioned by the keys used in the folder structure.

CREATE EXTERNAL TABLE example (
    foo string,
    bar string,
    baz string
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'

We then need to tell Athena about the partitions. We can either do this with ALTER TABLE example ADD PARTITION (year='2021, month=06, day=27);, or by running MSCK REPAIR TABLE example;, which will crawl the folder structure and add any partitions it finds. Once the partitions are loaded we can query the data, restricting the query to just the required partitions:

SELECT * FROM example
WHERE year=2021 AND month=6 AND day=27

The problem with this is that we either need to know every about every partition before we can query the data, or repair the table to make sure our partitions are up to date - a process that will take longer and longer to run as our table grows.

There is a better way! By using partition projection we can tell Athena where to look for partitions. At query time, if the partition doesn’t exist, the query will just return no rows for that partition. Queries should also be faster when there are a lot of partitions, since Athena doesn’t need to query the metadata store to find them.

CREATE EXTERNAL TABLE example (
    foo string,
    bar string,
    baz string
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'
TABLEPROPERTIES (
    'projection.enabled' = 'true',
    'projection.year.type' = 'integer',
    'projection.year.range' = '2020,2021',
    'projection.month.type' = 'integer',
    'projection.month.range' = '1-12',
    'projection.day.type' = 'integer',
    'projection.day.range' = '1-31',
    'storage.location.template' = 's3://mybucket/data/${year}/${month}/${day}/'
)

We can query this table immediately, without needing to run ADD PARTITION or REPAIR TABLE, since Athena now knows what partitions can exist. Since we need to provide Athena with the range of expected values for each key, the year partition range will eventually need to be updated to keep up with new data.

Another option is to project an actual date partition. This time we treat the date path in S3 (yyyy/MM/dd) as a single partition key, which Athena will read and convert to a date field. We call this partition date_created as date is a reserved keyword.

CREATE EXTERNAL TABLE example (
    foo string,
    bar string,
    baz string
)
PARTITIONED BY (date_created string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.date_created.type' = 'date',
    'projection.date_created.format' = 'yyyy/MM/dd',
    'projection.date_created.interval' = '1',
    'projection.date_created.interval.unit' = 'DAYS',
    'projection.date_created.range' = '2021/01/01,NOW',
    'storage.location.template' = 's3://mybucket/data/${date_created}/'
)

With a date partition we no longer need to update the partition ranges. Using NOW for the upper boundary allows new data to automatically become queryable at the appropriate UTC time. We can also now use the date() function in queries and Athena will still find the required partitions to limit the amount of data read.

SELECT * FROM example
WHERE date_created >= date('2021-06-27')

ECS Anywhere

AWS finally released ECS Anywhere last week, which allows you to use ECS to schedule tasks on on-premise hosts. The whole setup is very straightforward, and it’s quite reasonably priced at $0.01025 per hour for each managed ECS Anywhere on-premises instance - about $1.72 per week per host.

We need a couple of bits of supporting infrastructure first: an IAM role for our on-premise hosts, and an ECS cluster.

data "aws_iam_policy_document" "assume_role_policy_ssm" {
  statement {
    effect = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type = "Service"
      identifiers = ["ssm.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "ecs_anywhere" {
    name               = "ECSAnywhere"
    assume_role_policy = data.aws_iam_policy_document.assume_role_policy_ssm.json
}

resource "aws_iam_role_policy_attachment" "amazon_ssm_managed_instance_core" {
  role       = aws_iam_role.ecs_anywhere.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

resource "aws_iam_role_policy_attachment" "amazon_ec2_container_service_for_ec2_role" {
  role       = aws_iam_role.ecs_anywhere.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}

resource "aws_ecs_cluster" "cluster" {
  name = "ECS-Anywhere"
}

Once that’s done we need to create an authorisation for each managed instance we want to add with aws ssm create-activation --iam-role ECSAnywhereRole. This returns an ActivationId and ActivationCode, which will be used to register the instances with Systems Manager.

Finally we are ready to create the cluster. On each machine we just need to download the provided install script, and run it, passing in the region, cluster name and SSM activation codes.

curl --proto "https" -o "/tmp/ecs-anywhere-install.sh" "https://amazon-ecs-agent.s3.amazonaws.com/ecs-anywhere-install-latest.sh"
sudo bash /tmp/ecs-anywhere-install.sh --region $REGION --cluster $CLUSTER_NAME --activation-id $ACTIVATION_ID --activation-code $ACTIVATION_CODE

That’s really all there is to it. The instances should appear in the ECS cluster console with instance IDs beginning with mi-.

ECS cluster with on-premise instances

Now that our cluster is up and running we can create a task definition and deploy it to our servers. Here I’ve just used the example task definition from the docs.

ECS Service running on on-premise instances

Accessing your current IP in Terraform

Even with session manager for accessing instances, sometimes it’s handy to just open up a port to your current IP address - to allow access to a load balancer for example. One quick way to do this is with an external data source.

data "external" "current_ip" {
  program = ["bash", "-c", "curl -s 'https://api.ipify.org?format=json'"]
}

As long as the program returns JSON, we can access it’s properties, for example in a security group rule: cidr_blocks = "${data.external.current_ip.result.ip}/32".

Don’t use this for anything other than testing though, since it’ll change if anyone else runs an apply!

Monitoring APT Updates with Grafana & Prometheus

Pending Update Metrics

APT conveniently has some hooks available to run custom scripts before, during and after patching. We can take advantage of these to publish a metrics file that can be picked up by node_exporter to monitor the status of pending updates across our servers.

First we need a script to get the number of updates available, and if a reboot is required. We are leaning on the script in the update-notifier-common package, which outputs the number of updates, and security updates pending.

#!/bin/bash -e
# /usr/share/apt-metrics

APT_CHECK=$([ -f /var/run/reboot-required ] && /usr/lib/update-notifier/apt-check || echo "0;0")

UPDATES=$(echo "$APT_CHECK" | cut -d ';' -f 1)
SECURITY=$(echo "$APT_CHECK" | cut -d ';' -f 2)
REBOOT=$([ -f /var/run/reboot-required ] && echo 1 || echo 0)

echo "# HELP apt_upgrades_pending Apt package pending updates by origin."
echo "# TYPE apt_upgrades_pending gauge"
echo "apt_upgrades_pending ${UPDATES}"

echo "# HELP apt_security_upgrades_pending Apt package pending security updates by origin."
echo "# TYPE apt_security_upgrades_pending gauge"
echo "apt_security_upgrades_pending ${SECURITY}"

echo "# HELP node_reboot_required Node reboot is required for software updates."
echo "# TYPE node_reboot_required gauge"
echo "node_reboot_required ${REBOOT}"

We set up the APT::Update::Post-Invoke-Success and DPkg::Post-Invoke triggers to call this script, which will update our metric after each apt update run, and after each package installation step.

# /etc/apt/apt.conf.d/60prometheus-metrics
APT::Update::Post-Invoke-Success {
  "/usr/share/apt-metrics | sponge /var/lib/node_exporter/textfile_collector/apt.prom || true"
};

DPkg::Post-Invoke {
  "/usr/share/apt-metrics | sponge /var/lib/node_exporter/textfile_collector/apt.prom || true"
};

As long as APT::Periodic::Update-Package-Lists is set in /etc/apt/apt.conf.d/10periodic, pending updates will now be exported as metrics via node_exporter. If unnattended-upgrades is installed and configured the metrics will also go back down as updates are installed automatically.

Automatic Update Annotations

We can take it a step further and add Grafana annotations for automatic updates activity, to show what updates are being installed. These annotations are stored in Grafana, against a specific dashboard. In these examples my dasbboard ID is 3. I’ve also added a Grafana API key in /etc/environment to allow us to push annotations.

We need to add an environment file for apt-daily-upgrade.service to pass in some additional options to the apt-daily-upgrade service. This will run our /usr/share/annotate script when the update job starts and stops.

# /etc/systemd/system/apt-daily-upgrade.service.d/environment
[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/share/annotate -d 3
ExecStartPost=-/usr/share/annotate

We also add another apt hook to record the details of each package before it is installed. This will be pushed as the body of the annotation once the apt run is complete.

# /etc/apt/apt.conf.d/60annotations
DPkg::Pre-Install-Pkgs {
	"/usr/share/annotate -p - || true";
};

The annotate script does most of the work. When updates start it creates an annotation in Grafana, and keeps a record if it under /var/run. When patching is complete the script updates the annotation to add an end time, and updates the body of the annotation with the details of the installed patches. The script calls grafana-annotation.py to create the annotations, which is a simple wrapper around the annotation API calls.

#!/bin/bash -e
# /usr/share/anotate

while getopts ":d:p:" opt; do
    case $opt in
        d)
            DASHBOARD="$OPTARG"
            ;;
        p)
            PATCH="$OPTARG"
            ;;
        \?)
            echo "Invalid option -$OPTARG" >&2
            exit 1
            ;;
        :)
            echo "Option -$OPTARG requires an argument." >&2
            exit 1
            ;;
    esac
done

ANNOTATE=/usr/share/grafana-annotation.py
ANNOTATION_TMP=/var/run/unattended-upgrades-annotation.json
ANNOTATION_LOG=/var/run/unattended-upgrades-annotation-log

urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }

if [[ -n "${DASHBOARD}" ]]; then
    echo "Annotating dashboard ${DASHBOARD}"
    # Create the start annotation
    ${ANNOTATE} --dashboard "${DASHBOARD}" --message "Unattended upgrades started." --output "${ANNOTATION_TMP}"
    exit 0
fi

if [[ -f ${ANNOTATION_TMP} ]]; then
    if [[ -n "${PATCH}" ]]; then
        echo "Input: ${PATCH}"
        if [[ "${PATCH}" = '-' ]]; then
            # Read from stdin
            PATCH=$(cat)
        fi
        echo "Recording applied patches"
        # Add to log and stop since we're not done.
        echo "${PATCH}" >> ${ANNOTATION_LOG}
        exit 0
    fi

    ANNOTATION_ID=$(jq --raw-output .id "${ANNOTATION_TMP}")
    if [[ -f ${ANNOTATION_LOG} ]]; then
        # Update the annotation
        echo "Completing annotation ${ANNOTATION_ID}"
        # Add an end time to the annotation
        COMMON_PREFIX="/var/cache/apt/archives/"
        PREFIX_LENGTH=$((${#COMMON_PREFIX} + 1))
        MESSAGE=$(cat ${ANNOTATION_LOG} | sort | uniq | cut -c ${PREFIX_LENGTH}-)
        ${ANNOTATE} --annotation "${ANNOTATION_ID}" --end "$(date +%s)" --message "${MESSAGE}"
    else
        echo "Deleting annotation ${ANNOTATION_ID}"
        ${ANNOTATE} --delete "${ANNOTATION_ID}"
    fi

    rm -f ${ANNOTATION_TMP} || true
    rm -f ${ANNOTATION_LOG} || true
    exit 0
fi