11 Jan 2022
I’ve long been an advocate for using filters to improve the signal to noise ratio of email. Ideally you want this stuff to happen on the mail server, so that the filtering happens automatically, regardless of where you actually read your mail. I like to keep most mail that isn’t directly for me out of my inbox, and then automatically mark as read things that are noisy or just notifications / informational.
Gmail lets you create some pretty complex filters, but the UI for managing these can get quite cumbersome once you have more than a handful of rules. Fortunately gmail-yaml-filters exists to simplify the process.
I started by exporting my current rules as a backup, and then worked through the list duplicating them in the yaml
syntax. I was able to take the 28 rules exported from Gmail and represent them in just 6 top level rules. Running gmail-yaml-filters
on this file creates (more-or-less) exactly the same set of rules.
By combining rule definitions using the more
operator the rules are much simpler to parse. For example, I like to label mailing lists and move them out of the inbox. By using more
I can then selectively mark as read or delete.
- list: <list.name>
label: "Some List"
archive: true
not_important: true
more:
- subject: "Some annoying notification"
read: true
- from: something-noisy@example.com
read: true
delete: true
This generates an xml file with 3 filters:
- Everything from the mailing list is labeled with
Some List
, and archived.
- If the subject matches
Some annoying notification
it will be marked as read.
- If the sender is
something-noisy@example.com
it will be deleted.
To build this inside Gmail I would need to remember to add all the conditions and actions for every rule - forgetting to add the list
condition to the last rule would delete everything from that address, not just messages to that list.
It’s also easy to make fairly complex rules:
- from:
all:
- -work.com
- -example.com
more:
- subject:
any:
- webcast
- webinar
- workshop
- scrum
label: Webinars
archive: true
- has:
any:
- webcast
- webinar
- workshop
- scrum
label: Webinars
archive: true
By not having any actions in the top level element, this creates two rules, which both include the not
filter at the top.
Once you have a working filter set it only takes a few minutes to export it as xml, and import into Gmail. Technically you could give it access to do this for you but I don’t really trust anything to log into my email.
27 Jun 2021
We can make sure Athena only reads as much data as it needs for a particular query by partitioning our data. We do this by storing the data files in a Hive folder structure that represents the patitions we’ll use in our queries.
s3://mybucket/data/year=2021/month=06/day=27/file1.json
s3://mybucket/data/year=2021/month=06/day=27/file2.json
s3://mybucket/data/year=2021/month=06/day=28/file1.json
We can then create a table partitioned by the keys used in the folder structure.
CREATE EXTERNAL TABLE example (
foo string,
bar string,
baz string
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'
We then need to tell Athena about the partitions. We can either do this with ALTER TABLE example ADD PARTITION (year='2021, month=06, day=27);
, or by running MSCK REPAIR TABLE example;
, which will crawl the folder structure and add any partitions it finds. Once the partitions are loaded we can query the data, restricting the query to just the required partitions:
SELECT * FROM example
WHERE year=2021 AND month=6 AND day=27
The problem with this is that we either need to know every about every partition before we can query the data, or repair the table to make sure our partitions are up to date - a process that will take longer and longer to run as our table grows.
There is a better way! By using partition projection we can tell Athena where to look for partitions. At query time, if the partition doesn’t exist, the query will just return no rows for that partition. Queries should also be faster when there are a lot of partitions, since Athena doesn’t need to query the metadata store to find them.
CREATE EXTERNAL TABLE example (
foo string,
bar string,
baz string
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'
TABLEPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2020,2021',
'projection.month.type' = 'integer',
'projection.month.range' = '1-12',
'projection.day.type' = 'integer',
'projection.day.range' = '1-31',
'storage.location.template' = 's3://mybucket/data/${year}/${month}/${day}/'
)
We can query this table immediately, without needing to run ADD PARTITION
or REPAIR TABLE
, since Athena now knows what partitions can exist. Since we need to provide Athena with the range of expected values for each key, the year partition range will eventually need to be updated to keep up with new data.
Another option is to project an actual date
partition. This time we treat the date path in S3 (yyyy/MM/dd
) as a single partition key, which Athena will read and convert to a date field. We call this partition date_created
as date
is a reserved keyword.
CREATE EXTERNAL TABLE example (
foo string,
bar string,
baz string
)
PARTITIONED BY (date_created string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.date_created.type' = 'date',
'projection.date_created.format' = 'yyyy/MM/dd',
'projection.date_created.interval' = '1',
'projection.date_created.interval.unit' = 'DAYS',
'projection.date_created.range' = '2021/01/01,NOW',
'storage.location.template' = 's3://mybucket/data/${date_created}/'
)
With a date partition we no longer need to update the partition ranges. Using NOW
for the upper boundary allows new data to automatically become queryable at the appropriate UTC time. We can also now use the date()
function in queries and Athena will still find the required partitions to limit the amount of data read.
SELECT * FROM example
WHERE date_created >= date('2021-06-27')
01 May 2021
AWS finally released ECS Anywhere last week, which allows you to use ECS to schedule tasks on on-premise hosts. The whole setup is very straightforward, and it’s quite reasonably priced at $0.01025 per hour for each managed ECS Anywhere on-premises instance - about $1.72 per week per host.
We need a couple of bits of supporting infrastructure first: an IAM role for our on-premise hosts, and an ECS cluster.
Once that’s done we need to create an authorisation for each managed instance we want to add with aws ssm create-activation --iam-role ECSAnywhereRole
. This returns an ActivationId
and ActivationCode
, which will be used to register the instances with Systems Manager.
Finally we are ready to create the cluster. On each machine we just need to download the provided install script, and run it, passing in the region, cluster name and SSM activation codes.
curl --proto "https" -o "/tmp/ecs-anywhere-install.sh" "https://amazon-ecs-agent.s3.amazonaws.com/ecs-anywhere-install-latest.sh"
sudo bash /tmp/ecs-anywhere-install.sh --region $REGION --cluster $CLUSTER_NAME --activation-id $ACTIVATION_ID --activation-code $ACTIVATION_CODE
That’s really all there is to it. The instances should appear in the ECS cluster console with instance IDs beginning with mi-
.
Now that our cluster is up and running we can create a task definition and deploy it to our servers. Here I’ve just used the example task definition from the docs.
20 Apr 2021
Even with session manager for accessing instances, sometimes it’s handy to just open up a port to your current IP address - to allow access to a load balancer for example. One quick way to do this is with an external data source.
As long as the program
returns JSON, we can access it’s properties, for example in a security group rule: cidr_blocks = "${data.external.current_ip.result.ip}/32"
.
Don’t use this for anything other than testing though, since it’ll change if anyone else runs an apply!
04 Dec 2020
Pending Update Metrics
APT conveniently has some hooks available to run custom scripts before, during and after patching. We can take advantage of these to publish a metrics file that can be picked up by node_exporter
to monitor the status of pending updates across our servers.
First we need a script to get the number of updates available, and if a reboot is required. We are leaning on the script in the update-notifier-common
package, which outputs the number of updates, and security updates pending.
#!/bin/bash -e
# /usr/share/apt-metrics
APT_CHECK=$([ -f /var/run/reboot-required ] && /usr/lib/update-notifier/apt-check || echo "0;0")
UPDATES=$(echo "$APT_CHECK" | cut -d ';' -f 1)
SECURITY=$(echo "$APT_CHECK" | cut -d ';' -f 2)
REBOOT=$([ -f /var/run/reboot-required ] && echo 1 || echo 0)
echo "# HELP apt_upgrades_pending Apt package pending updates by origin."
echo "# TYPE apt_upgrades_pending gauge"
echo "apt_upgrades_pending ${UPDATES}"
echo "# HELP apt_security_upgrades_pending Apt package pending security updates by origin."
echo "# TYPE apt_security_upgrades_pending gauge"
echo "apt_security_upgrades_pending ${SECURITY}"
echo "# HELP node_reboot_required Node reboot is required for software updates."
echo "# TYPE node_reboot_required gauge"
echo "node_reboot_required ${REBOOT}"
We set up the APT::Update::Post-Invoke-Success
and DPkg::Post-Invoke
triggers to call this script, which will update our metric after each apt update run, and after each package installation step.
# /etc/apt/apt.conf.d/60prometheus-metrics
APT::Update::Post-Invoke-Success {
"/usr/share/apt-metrics | sponge /var/lib/node_exporter/textfile_collector/apt.prom || true"
};
DPkg::Post-Invoke {
"/usr/share/apt-metrics | sponge /var/lib/node_exporter/textfile_collector/apt.prom || true"
};
As long as APT::Periodic::Update-Package-Lists
is set in /etc/apt/apt.conf.d/10periodic
, pending updates will now be exported as metrics via node_exporter
. If unnattended-upgrades is installed and configured the metrics will also go back down as updates are installed automatically.
Automatic Update Annotations
We can take it a step further and add Grafana annotations for automatic updates activity, to show what updates are being installed. These annotations are stored in Grafana, against a specific dashboard. In these examples my dasbboard ID is 3. I’ve also added a Grafana API key in /etc/environment to allow us to push annotations.
We need to add an environment
file for apt-daily-upgrade.service
to pass in some additional options to the apt-daily-upgrade
service. This will run our /usr/share/annotate
script when the update job starts and stops.
# /etc/systemd/system/apt-daily-upgrade.service.d/environment
[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/share/annotate -d 3
ExecStartPost=-/usr/share/annotate
We also add another apt hook to record the details of each package before it is installed. This will be pushed as the body of the annotation once the apt run is complete.
# /etc/apt/apt.conf.d/60annotations
DPkg::Pre-Install-Pkgs {
"/usr/share/annotate -p - || true";
};
The annotate
script does most of the work. When updates start it creates an annotation in Grafana, and keeps a record if it under /var/run
. When patching is complete the script updates the annotation to add an end time, and updates the body of the annotation with the details of the installed patches. The script calls grafana-annotation.py to create the annotations, which is a simple wrapper around the annotation API calls.
#!/bin/bash -e
# /usr/share/anotate
while getopts ":d:p:" opt; do
case $opt in
d)
DASHBOARD="$OPTARG"
;;
p)
PATCH="$OPTARG"
;;
\?)
echo "Invalid option -$OPTARG" >&2
exit 1
;;
:)
echo "Option -$OPTARG requires an argument." >&2
exit 1
;;
esac
done
ANNOTATE=/usr/share/grafana-annotation.py
ANNOTATION_TMP=/var/run/unattended-upgrades-annotation.json
ANNOTATION_LOG=/var/run/unattended-upgrades-annotation-log
urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
if [[ -n "${DASHBOARD}" ]]; then
echo "Annotating dashboard ${DASHBOARD}"
# Create the start annotation
${ANNOTATE} --dashboard "${DASHBOARD}" --message "Unattended upgrades started." --output "${ANNOTATION_TMP}"
exit 0
fi
if [[ -f ${ANNOTATION_TMP} ]]; then
if [[ -n "${PATCH}" ]]; then
echo "Input: ${PATCH}"
if [[ "${PATCH}" = '-' ]]; then
# Read from stdin
PATCH=$(cat)
fi
echo "Recording applied patches"
# Add to log and stop since we're not done.
echo "${PATCH}" >> ${ANNOTATION_LOG}
exit 0
fi
ANNOTATION_ID=$(jq --raw-output .id "${ANNOTATION_TMP}")
if [[ -f ${ANNOTATION_LOG} ]]; then
# Update the annotation
echo "Completing annotation ${ANNOTATION_ID}"
# Add an end time to the annotation
COMMON_PREFIX="/var/cache/apt/archives/"
PREFIX_LENGTH=$((${#COMMON_PREFIX} + 1))
MESSAGE=$(cat ${ANNOTATION_LOG} | sort | uniq | cut -c ${PREFIX_LENGTH}-)
${ANNOTATE} --annotation "${ANNOTATION_ID}" --end "$(date +%s)" --message "${MESSAGE}"
else
echo "Deleting annotation ${ANNOTATION_ID}"
${ANNOTATE} --delete "${ANNOTATION_ID}"
fi
rm -f ${ANNOTATION_TMP} || true
rm -f ${ANNOTATION_LOG} || true
exit 0
fi