Release of the Startup CTO’s Handbook

As of today my book is published and available on Amazon!

Sometime in mid-2021 I was recommended to read The Great CEO Within by Matt Mochary. I absolutely loved it – no fluff, direct, to the point, actionable mini-chapters on a wide range of topics relevant to a business leader. After reading it, I couldn’t help but think that there was a gap in the market for a similar manual, a handbook, for technical leaders.

I don’t mean a book covering enterprise design patterns, or how to write clean Javascript. There are plenty of excellent resources on how to do the technology part of a tech leader’s role. I was craving a resource for the leader part of “tech leader.”

And so last year I sat down and started writing, soliciting topic areas from friends, peers, social media, that would be useful in a tech leaders handbook. 15 months later, and with many thanks to the wonderful people at my publisher WorldChangers Media, the book is finally here!

My goal is that the handbook is a useful tool that grows over time and improves organically, with community feedback and criticism. To that end, the book is NOT copyrighted, it is licensed as CC BY-NC-SA 4.0 – Which essentially means you’re free to make copies of the book, and even make changes, as long as you keep my name attached and don’t try and resell your work. As a result, I’m keeping a public link to the content up for free – https://github.com/ZachGoldberg/Startup-CTO-Handbook. I would very much appreciate and look forward to your pull requests/issues!

Running your README in CI

The continuous integration (CI) environment has become an essential component in modern software development workflows. I’ve seen CI most often used to run basic builds — ensuring code compiles, passes lint/security/code scanning rules, ensure unit tests pass. Adventurous teams even trigger end to end builds or visual regression tests. From my experience teams usually have custom build scripts specifically setup for CI, such as GitHub actions YAML scripts or separate make files. There’s certainly nothing wrong with that, however in this blog post I’ll endeavor to convince you to either replace that build system, or perhaps augment it, with a full developer-like build based on your README.

Motivation

If you’ve not yet read The Unicorn Project by Gene Kim I’ll give you the quick summary: high performing teams run builds constantly and can onboard a new developer from zero to testing code in a production like environment in minutes or hours, not days or weeks.

We use CI to verify our most important goals, that code is defect free and can be deployed with confidence to production. Why not also use CI to ensure we can deploy locally with confidence?

The Hot Take

Run a build in CI using the same steps as your developer README. If you’re README says

yarn
yarn start

then make sure you do that in CI. If your developer build says

docker compose up

then install docker in your CI environment and do that. If your README has multiple pages of convoluted steps required to check out multiple repositories and run multiple commands, then you’ve got some developer experience debt to simplify and automate that process so it can be automated simply.

But what about secrets and configuration?

A common objection/concern to running a build in CI has to do with all the dependencies required, with secrets and configuration being the most painful. Many teams have a “secrets” file that is passed down from developer to developer or some other bureaucratic process. A better path is to commit secret names or locations, not values, to your repository. You can then use runtime tools (e.g. Berglas or whisper) to authenticate with a secret store to pull the values. All that’s left then is to setup your CI environment with authentication (perhaps via Workload identity federation) so it can pull down a non-production version of the configuration it needs.

Benefits

Running a developer-like build in a CI environment is a valuable opportunity to validate the build process and ensure that all dependencies, configurations, and components are working correctly. This step guarantees that a developer can build and deploy the application without any hiccups, reducing their onboarding time and improving their initial productivity.

A developer-build executed in a CI environment helps the technical team identify any integration issues or broken code early on. Early detection means we can prevent breakages from being merged, and we’re much more likely to fix the breakage efficiently if it’s caught quickly, reducing build maintenance cost as well.

Access your 3D printer from anywhere with Octoprint and Zerotier

In this article I’ll walk you through how to set up your (octoprint-compatible) 3d printer such that you can start, stop and monitor your prints from anywhere.

But first, some quick backstory. My wife has recently picked up glass blowing, and she asked me for help making some molds. We had a go using some modelling clay and, well, let’s just say it was pretty clear a 3D printer would get us more accurate/usable results. On Cyber Monday the Anycubic Kobra Go printer went on sale for under $200 for a printer & full spool of filament combo and I couldn’t resist.

Fast forward a week and the printer is setup in the living room and I’m running back and fourth between my office and the printer with a micro SD card ten times a night. Not that I mind the exercise, but it all felt wildly inefficient.

Enter Octoprint, an open source print server and web interface that’s compatible with most consumer grade 3D printers. Octoprint is (generally) run on a Raspberry Pi and connected to a 3D printer with a USB cable. It also features first class support for Raspberry Pi or USB webcams for monitoring prints and recording time-lapse videos. Getting Octoprint running is straightforward and the instructions for getting started with OctoPi (a custom Raspberry Pi Image with Octoprint preinstalled) worked very well for me. If you go with OctoPi be sure to enter the “advanced” custom image menu and setup a SSH username and password, which we’ll use later.

So at this point you’ve got Octoprint running on your home network and you can control your 3D printer from anywhere in the house. Excellent. You then start a 3 hour print and fifteen minutes into the print your partner asks you to run to the grocery store for some eggs. But as the diligent 3D printers we are, we can’t let the printer go entirely unattended, what if there’s a layer shift? You’d have to stop the print to avoid wasting filament!

There are some very user friendly plugins that let you login to your Octoprint via the cloud. Generally they work well, some are free, some are paid, but they generally don’t have 100% feature parity, and of course, they rely on third party servers. Enter ZeroTier. ZeroTier is a networking tool that, in effect, lets you set up what feels like a VPN with extremely minimal effort. As they like to describe, it’s “click, click done.”

Step one is to setup a zerotier network. This is done by creating an account on zerotier.com. Once there, hit “Create a Network” which, with a single click, will create your virtual network and give you a network ID, something like “225ab4422bc54915fd.”

Step two is to setup zerotier on your mobile device, be it a laptop or phone, there are clients for Windows, OS X, Linux, Android and iOS etc.. You won’t need to login to zerotier on the device, you’ll only need to join a network, which just requires network ID from step one. Once the device has joined the network, you’ll go back to zerotier.com and Authorize the device. Here’s a tutorial video that walks through this process.

Step three is to setup zerotier on your OctoPi. You’ll need ssh access to your OctoPrint server for this step, which if you’re using OctoPi was done during image generation.

ssh pi@octopi.local
$ curl -s https://install.zerotier.com | sudo bash
$ zerotier-cli join NETWORK_ID_FROM_STEP_ONE

Step four – Finally, return to zerotier.com and authorize the raspberry pi to join the network. That’s it, your mobile device and pi are connected on what will feel like a LAN, regardless of what network you’re connected to! Zerotier.com will show you an internal IP (eg. 172.11.1.2) for your Octoprint server which you can use on your mobile device in a web browser, or any compatible octoprint app, to get full access to the Octoprint server, webcam, gcode streaming and all.

Note: This setup does require an online account with Zerotier, however they require no payment or personal info to setup a network as described here. Traffic between your mobile device and Octoprint will primarily be point-to-point and not routed through any Zerotier servers, however it is possible that the first few packets on the network will be proxied by Zerotier itself. You can learn more about Zerotier here.

SF Engineering Leadership Annual 2022

I had the good fortune of attending the San Francisco Engineering Leadership Community 2022 Annual conference, their first since the start of the COVID19 Pandemic. The event took place over 2 days at Fort Mason in San Francisco, had about a dozen vendors with booths, three “keynote” spaces and a handful of tables for smaller group conversations. In this blog post I’ll summarize some of my observations and learning from the two days:

Management Tips

Build & Staff Teams by Business Priority

Most companies will naturally devote time and energy to teams proportionate to their size. If you have six engineers divided into two teams of three, then it’s natural for each team to get roughly similar energy and attention across the business, even if one of the team is working on a problem that is meaningfully higher impact to the business. Be careful to design teams and team sizes to match business value, or to explicitly devote more energy to the higher impact groups.

Dev Experience is a Major Trend

Most successful tech organizations (Netflix, Google, Facebook, Slack etc.) spend at least 10% of all tech resources on developer experience and tools.

High Performing Teams

The Story Points fad is dead — every leader I spoke to or polled, about 20 in all, agreed that measuring and optimizing for velocity via story points (or similar) is not productive.

One way to identify high(est) performing teams: The 360 Team NPS Score. You survey other teams throughout the organization and have them take an NPS survey for your various engineering teams. You then ask each team to do an internal NPS score (commonly called an eNPS). If both the external and internal NPS scores come back good, meaning that other teams perceive the team as high performing, and the team internally is happy, then you’ve probably got a high performing team.

Team of Teams by Stanley McChrystal was referenced a few times, great book. In general there was a lot of focus on the impact of leadership empowering their teams; creating high stakes and strong mission alignment.

Remote Work

Remote is hardest on junior employees.  It’s hard for them to get interrupt driven help and mentorship.  Some common solutions include hybrid work (teams get together in person at least once a week), hiring fewer Juniors, ramping up juniors in person before going to distributed/remote.

Internal Q&A sites sound good on paper but don’t take off. Any manager I spoke to about Stack Overflow for teams (or Gleen or Threads) that said they tried it said it failed to catch on. 

Slack conversation is for ephemeral content. Identifying and porting knowledge from Slack to a wiki is a lossy process, nobody identified any robust or reliable processes other than “keep encouraging everyone to use the wiki.”

In hybrid organizations all in person employees should join remote meetings independently. This was widely agreed upon as the only way to ensure a productive hybrid meeting. 

Donuts have been medium successful at other remote teams. They’re better than nothing but not a silver bullet to creating serendipity and social connection.

I asked a focus group of 15 other engineering leaders what kind of team they think they’ll be managing in 10-15 years, 100% of hands went up for remote, 0% for in person.

Idea for minimizing “unread Slack channel anxiety” — denote some Slack channels as special and required reading, have people star those channels during on boarding.  Then post infrequent but important updates there.  Everything else should be assumed to be ephemeral

DORA Metrics

DORA Metrics are four metrics designed to measure the speed and quality of an engineering team. Sleuth, a company that measures DORA metrics, was at the conference and happy to espouse the benefits of continuous deployment.

Interestingly, at an earlier round-table in the day discussing high performing teams I asked how many other managers were tracking these metrics, and of about 15 people none indicated they were familiar with DORA metrics. Everyone knew what continuous deployment was though and universal sentiment was it is A Good Thing to strive for.

Memorable Quotes

“You can’t a/b test organizational changes or management decisions, especially in a growing organization”

“Don’t look back, you’re not going that way” — thinking about careers

“Tools dictate your process, your process informs your culture, your culture guides tool choice.”

Misc. Conference Tips

Small Groups are Key

In general the larger talks I found to be, on the whole, lower value than the small group conversations. That’s not to say they were of no value, it was nice to hear directly from some folks who have done very respectable things talk about their journey and get a sense of who they are as people. VP of Engineering at Vercel Lindsey Simon, in particular, has a great sense of humor and I found to be a very engaging speaker. The small groups, though, were considerably more thought provoking and is where I spent most of my time.

Being Curious is Key

On two occasions I went out of my way to be empathetic or curious with vendors at the conference. The first was an engineer asking some hard hitting questions to a sales rep of some SaaS software. They were fair questions, but to me it was clear the sales person was out of their technical depth and not able to produce a satisfying answer. After listening for a minute or so I took an educated guess as to what the questioner might be looking for, throwing a bone to the sales person and letting them off the hook for that line of questioning. Needless to say he was very thankful and we had an extended and honest conversation both about his product and the world of selling SaaS thereafter.

Not long thereafter I met the founder of Metaview.ai. He, as the curious customer-focused founder he is, started asking me about my company and interview process. I gave him the rundown, and then started to get very curious about his business. What motivated him to solve this problem, how does he think about interviews, how to provide fair/consistent interview experiences, philosophy on training teams to hire etc. We got into it for a good few minutes and I must have made a good impression as he forwarded me an invitation to a dinner his company was curating that evening. I graciously accepted and that dinner turned out to be one of the highlights of the trip for me!

Fun / Random Knowledge

All the sounds in Slack came from their first idea that they pivoted from: building a video game

Vercel is pronounced ver-sell not versil

Photos

Jon Hansley, CEO of Emerge, a product consultancy in Oregon, discussing Alignment
Free book #1
Free book #2

 

Setting up budgets for cloud usage with Terraform

From time to time your team may want to use a new service from your cloud provider. That request may come with an estimated usage cost for the service and if it fits in the budget and seems good ROI it will be approved.  For most startup projects, that’s where the cloud cost control ends. With just a bit of extra effort, especially if resources are already being provisioned with Terraform, you can use budgeting tools offered by Amazon, Google etc. to ensure the actual cost aligns with expectations.

For the purposes of this example, I’ll use Google Cloud Budgets, but the analogous resources and APIs exist in AWS and Azure.

Goal: Add a budget to monitor the cost of a new Google Cloud Run service your team wants to deploy. 

Prerequisites: An operational knowledge of Terraform and editor access to a Google Cloud Project & Google Cloud Billing Account

Part 0 – Become familiar with your cloud provider’s budgeting tool

If you haven’t spent a few minutes creating a budget using the cloud console itself. The various parameters and options in Terraform will make a lot more sense if you’ve already got the context and perspective of how the budgeting process as a whole works. In Google Cloud budgets are under “Budgets & alerts” in the billing section.

Part 1 – Setup the cloud run project

This is just a sample directly from the terraform resource documentation

resource "google_cloud_run_service" "default" {
  name     = "cloudrun-srv"
  location = "us-central1"

  template {
    spec {
      containers {
        image = "us-docker.pkg.dev/cloudrun/container/hello"
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

Part 2 – Find the service ID

When setting up a budget with Google Cloud you have the option to have the budget monitor cost of a specific service via a filter. The terraform resource for creating budgets with these filters requires you specify the service by the service’s ID. You can find the ID of various services in the cloud console UI as per the screenshot below.

Part 3 – Setup the budget

The below terraform code is lightly modified from the sample code in the google cloud budget terraform resource documentation

data "google_billing_account" "account" {
  billing_account = "000000-0000000-0000000-000000"
}

data "google_project" "project" {
}

resource "google_billing_budget" "budget" {
  billing_account = data.google_billing_account.account.id
  display_name = "Project X Cloud Run Billing Budget"

  budget_filter {
    projects = ["projects/${data.google_project.project.number}"]
    credit_types_treatment = "EXCLUDE_ALL_CREDITS"
    services = ["services/152E-C115-5142"] # Cloud Run
  }

  amount {
    specified_amount {
      currency_code = "USD"
      units = "100" # $100 per month
    }
  }

  threshold_rules {
    threshold_percent = 0.5 # Alert at $50
  }
  threshold_rules {
    threshold_percent = 0.9 # Alert when forecast to hit $90
    spend_basis = "FORECASTED_SPEND"
  }
}

You can even set up custom alerting rules so the teams that create new infrastructure at the ones notified if/when spend exceeds the amount forecast during planning and development.

Production SQL Server Checklist & Best Practices

Much ink has been spilled on which database you should use, or how to think about which database to use, for your project.  My aim in this post is not to sell you on any database paradigm, rather to serve as a reference guide and checklist for how to responsibly host a SQL server, be it MySQL, Postgresql or other, in production.

Before calling your SQL database plan ready-for-implementation, ask yourself if you’ve thought about all these requirements:

Read only replicas
Multi-zone hosting
☐ Automated daily backups
☐ One click rollback / backup restore
Event based audit logging / full table histories / log replication
☐ Automatic disk expansion
☐ High quality Migration tooling
☐ Connection/IP Security
☐ Local dev versions / easy ability to download prod data
Staging / recent-prod replica environments
CPU & Memory monitoring / auto scaling
Slow query monitoring
☐ High quality ORM / DB Connection library

In practice it’s very expensive or impossible to do all of these things yourself, your best bet is to chose a solution that comes with many of these features out of the box such as Google Cloud SQL or Amazon RDS; just make sure to enable the features you care about.

—————-

Read only replicas

More often than not a production SQL server will have use cases that can easily be divided between read heavy vs. write heavy.  The most common is perhaps the desire to do analytics processing on transaction data.  Generally this should be handled with a proper data pipeline/enterprise data warehouse, but having a real time readonly mirror is a good practice regardless even for your ELT tools.

Multi-zone hosting

If AWS us-east-1 goes down, will your application survive?  Have a plan to ensure data is replicated in real time between zones, or even better between datacenters entirely.  

Automated daily backups

Ideally you have at least daily, if not more regular, full backups that are sent off site.  Depending on your requirements perhaps that’s an exported zip file to a storage bucket with the same cloud provider, or perhaps it’s a bucket in an entirely different cloud.  Make sure that everything about this process is secure and locked down tight, these are entire copies of your database after all.  

This is a good use case for that realtime read only replica.

One click rollback / backup restore

Most cloud hosted SQL options will offer the option of one-click point in time restore.  At a minimum ensure you have an entirely automated way, that is tested regularly, to restore from one of your hourly or daily backups.  

Event based audit logging / full table histories / log replication

Different databases have different terminology for this, in PSQL they’re replication slots, in MSSQL it’s log replication.  The idea is you want CDC — change data capture — for every mutation to every table recorded in a data warehouse for your analytics team to do as they need.  Such data can be used to produce business audit logs, or run point-in-time analytics queries to ask questions for users such as “what was my inventory like last week?”

Automatic disk expansion

Nobody likes getting an alarm at 3AM that their database has hit its disk storage limit.  In case it’s not obvious, very bad things happen when a database runs out of disk space.  Make sure your SQL solution never runs out of disk by using a platform/tool that will expand automatically.  Ideally it shrinks automatically too.

High quality Migration tooling

Schema and data migrations are hard, don’t try and solve these problems yourself.  Use a tool or framework that will help you generate migrations and manage the execution of migrations across various environments.  Remember that your migration has to work locally for developers who have used this repository before and new developers, as well as in all staging, feature branch and production environments.  Don’t underestimate the difficulty of this challenge.

Connection/IP Security

Often you can get away with IP allowlisting access to a database, but in 2022 that’s going out of style (and will be flagged by PCI or SOC2 auditors). Nowadays your database should be in a private VPC with no internet access and networked/peered with your application servers. Keep in mind that this will make access for developers challenging, that’s a good thing!  It’s a good idea to have a strategy, either with a proxy or a bastion host, for emergencies though.

Local dev versions / easy ability to download prod data

You’ll want tooling to download a copy of sanitized production data for testing.  Something that runs well on a local machine with 1000 rows may be unacceptably slow in production with 2 million records.  Those 2 million records may cause trouble not just due to volume, but also data heterogeneity — real world users will hit edge cases your developers may not.  

CPU, Memory, Connection monitoring / auto scaling

Ensure you have monitoring and, ideally autoscaling, on cpu, memory and connection counts for a SQL database.  It should be somebody’s job to check from time to time that these values are within acceptable ranges for your use case.

Cost Monitoring

SQL databases are generally some of the more expensive parts of the stack. I recommend you set up a budget using tools in your cloud provider so you know how much you’re spending and can monitor growth.

Slow query monitoring

It’s easy to shoot yourself in the foot with SQL, whether using an ORM or writing raw SQL, and generate very expensive and slow queries.  You’ll want logging and ideally alerting for anything abnormally slow that makes it to production.

High quality ORM / DB Connection library

Don’t forget about developer experience!  Do you want to be writing raw SQL or using an ORM/DAL?  There are tradeoffs in both directions, think through your options carefully.  Does the ORM come with a migration tool?  Does it have built-in connection pooling?  

A code-free IaaC link shortener using Kutt and GKE

Goal: Deploy a link shortener on your own domain without writing any (non-infrastructure) code.

Prerequisites: An operational Kubernetes cluster, knowledge of Kubernetes & Terraform, basic knowledge of Google Cloud Platform

At every company I’ve worked at in the past decade we’ve had some mechanism to create memorable links to commonly used documents.  Internally at Google, at least when I was there around 2010, they used the internally resolving name “go”, e.g. “go/payroll” or “go/chromedashboard” would point to internal payroll or internal project dashboards.  I suspect an ex-Googler liked the idea enough to make it a business, as GoLinks is a real thing you can pay for.  Below I’ll walk through how to setup Kutt (an open source link shortener) with Terraform in your own Kubernetes cluster in Google Cloud.

Kutt has several dependencies, so let’s make sure we’ve got those in orderg

  • You need a domain name and ability to set DNS records.  For example go.mycompany.com
  • You’ll need an SMTP Server for authenticating emails to the link shortener, we’ll use this just for the admin user.  Have your mail_host, mail_port, mail_user and mail_password at hand.
  • Optionally: A google analytics ID
  • A Redis Instance (we’ll deploy one with terraform)
  • A Postgresql database (we’ll deploy one with terraform)

For starters, let’s setup our variables.tf file. There’s quite a few values here, and there are more configuration options that can be passed into Kutt via env vars down the road.

variables.tf

variable "k8s_host" {
  description = “IP of your K8S API Server”
}

variable "cluster_ca_certificate" {
  description = “K8S cluster certificate”
}

variable "region" {
  description = "Region of resources"
}

variable "project_id" {
  description = “google cloud project ID”
}

variable "google_service_account" {
  description = “JSON Service account to talk to GCP”
}

variable "namespace" {
  description = “kubernetes namespace to deploy to”
}

variable "vpc_id" {
  description = “VPC to put the database in”
}

variable "domain" {
  default = "go.mycompany.com"
}

variable “jwt_secret” {
  default = “CHANGE-ME-TO-SOMETHING-UNIQUE”
}

variable “smtp_host” {}

variable “smtp_port” {
  default = 587
}

variable “smtp_user” {}

variable “smtp_password” {}

variable “admin_emails” {}

variable “mail_from” {
  default = “lnkshortner@mycompany.com” 
}

variable “google_analytics_id” {}

Now let’s set up our database.  You can really do this anyway you like, but if we’re using Google Kubernetes Engine we likely also have access to Google Cloud SQL, so this is fairly straightforward.

database.tf

resource "google_sql_database_instance" "linkshortenerdb" {
  name             = replace("linkshortener-${var.namespace}", "_", "-")
  database_version = "POSTGRES_13"
  region           = "us-west2"
  project          = var.project_id
  lifecycle {
    prevent_destroy = true
  }

  settings {
    # $7 per month
    tier = "db-f1-micro"
    backup_configuration {
      enabled                        = true
      location                       = "us"
      point_in_time_recovery_enabled = false
      backup_retention_settings {
        retained_backups = 30
      }
    }

    ip_configuration {
      ipv4_enabled = false
      # In order for private networks to work the GCP Service Network API has to be enabled
      private_network = var.vpc_id
      require_ssl     = false
    }
  }
}

resource "google_sql_database" "linkshortener" {
  name     = "${var.namespace}-linkshortener"
  instance = google_sql_database_instance.linkshortenerdb.name
  project  = var.project_id
}

resource "random_password" "psql_password" {
  length  = 16
  special = true
}

resource "google_sql_user" "linkshorteneruser" {
  project  = var.project_id
  name     = "linkshortener"
  instance = google_sql_database_instance.linkshortenerdb.name
  password = random_password.psql_password.result
}

That’s it (finally) for prerequisites, now the fun part, setting up Kutt itself (with a Redis sidecar)

kutt.tf

provider "google" {
 project     = var.project_id
 region      = var.region
 credentials = var.google_service_account
}

provider "google-beta" {
 project     = var.project_id
 region      = var.region
 credentials = var.google_service_account
}

data "google_client_config" "default" {}

provider "kubernetes" {
 host                   = "https://${var.k8s_host}"
 cluster_ca_certificate = var.cluster_ca_certificate
 token                  = data.google_client_config.default.access_token
}

resource "random_password" "redis_authstring" {
 length  = 16
 special = false
}

resource "kubernetes_deployment" "linkshortener" {
 metadata {
   name = "linkshortener"
   labels = {
     app = "linkshortener"
   }
   namespace = var.namespace
 }

 wait_for_rollout = false

 spec {
   replicas = 1
   selector {
     match_labels = {
       app = "linkshortener"
     }
   }

   template {
     metadata {
       labels = {
         app = "linkshortener"
       }
     }

     spec {
       container {
         image = "bitnami/redis:latest"
         name  = "redis"

         port {
           container_port = 3000
         }

         env {
           name  = "REDIS_PASSWORD"
           value = random_password.redis_authstring.result
         }

         env {
           name  = "REDIS_PORT_NUMBER"
           value = 3000
         }
       }

       container {
         image = "kutt/kutt"
         name  = "linkshortener"
         port {
           container_port = 80
         }

         env {
           name  = "PORT"
           value = "80"
         }

         env {
           name  = "DEFAULT_DOMAIN"
           value = var.domain
         }

         env {
           name  = "DB_HOST"
           value = google_sql_database_instance.linkshortenerdb.ip_address.0.ip_address
         }

         env {
           name  = "DB_PORT"
           value = "5432"
         }

         env {
           name  = "DB_USER"
           value = google_sql_user.linkshorteneruser.name
         }

         env {
           name  = "DB_PASSWORD"
           value = google_sql_user.linkshorteneruser.password
         }

         env {
           name  = "DB_NAME"
           value = google_sql_database.linkshortener.name
         }

         env {
           name  = "DB_SSL"
           value = "false"
         }

         env {
           name  = "REDIS_HOST"
           value = "localhost"
         }

         env {
           name  = "REDIS_PORT"
           value = "3000"
         }

         env {
           name  = "REDIS_PASSWORD"
           value = random_password.redis_authstring.result
         }

         env {
           name  = "JWT_SECRET"
           value = var.jwt_secret
         }

         env {
           name  = "ADMIN_EMAILS"
           value = var.admin_emails
         }

         env {
           name  = "SITE_NAME"
           value = "MyCompany Links"
         }

         env {
           name  = "MAIL_HOST"
           value = var.smtp_host
         }

         env {
           name  = "MAIL_PORT"
           value = var.smtp_port
         }

         env {
           name  = "MAIL_USER"
           value = var.smtp_user
         }

         env {
           name  = "MAIL_FROM"
           value = var.mail_from
         }

         env {
           name  = "DISALLOW_REGISTRATION"
           value = "true"
         }

         env {
           name  = "DISALLOW_ANONYMOUS_LINKS"
           value = "true"
         }

         env {
           name  = "GOOGLE_ANALYTICS"
           value = var.google_analytics_id
         }

         env {
           name  = "MAIL_PASSWORD"
           value = var.smtp_password
         }
         readiness_probe {
           http_get {
             path   = "/api/v2/health"
             port   = 80
             scheme = "HTTP"
           }
           timeout_seconds = 5
           period_seconds  = 10
         }
         resources {
           requests = {
             cpu    = "100m"
             memory = "200M"
           }
         }
       }
     }
   }
 }
}

With the above you should now have a postgres server, a redis instance and a kutt deployment deployed and talking to eachother. All that’s left is to expose your deployment as a service and setup your DNS records.

Kaizen in infrastructure: Writing RCAs to improve system reliability and build customer trust

I shall endeavor to convince you today that your company should really regularly write Root Cause Analyses (RCAs), not just for yourselves but also as a tool to build trust with your customers. The subject of RCAs can be a bit dry, so allow me to motivate with an example of how poorly-approached RCAs can easily become a hot-button issue that will cost you customers.

As the CTO of Lottery.com, I’m responsible for overseeing roughly 50 vendor relationships. For the vast majority of these vendors, Lottery.com (LDC) is a regular customer that doesn’t use their system much differently than any other customer, and things run smoothly. They charge a credit card every month, and we get a service that fulfills some business need. Yet, for a small handful of these vendors ‘business as usual’ just can’t shake persistent issues.

Allow me to tell you of an incident that occurred recently with a vendor; let’s call them WebCorp. WebCorp offers a service that we depend on for one of our services to be up. If WebCorp’s service is up, then our service is up. If WebCorp’s service goes down, we go down. In these situations, it’s in everyone’s interest for WebCorp to be reliable. So we codify that dependency in a contract called a service level agreement (SLA). SLAs are measured quantitatively, in terms of uptime percentage.

A side note on SLAs: A high quality service measures their uptime in “nines,” as in: three nines is 99.9% uptime. That may seem like a lot, but over the course of a year 3 nines of uptime translates to nearly 9 hours of timedown, or 1.5 minutes of downtime per day. With WebCorp, Lottery.com has a five nines SLA, which translates to five minutes of downtime per year.

OK, back to the story: Lottery.com got an alert in the early afternoon around noon PDT that our service was down. Within about five minutes we had determined, conclusively, that the cause of the downtime was a service failure at WebCorp. I emailed WebCorp’s emergency helpline and, to their credit, within a few minutes they acknowledged the issue and indicated they were looking into it. About an hour later they had resolved the issue and our service was back online. Total downtime was about 64 minutes.

When a vendor has an outage it is my standard practice, once the issue is resolved, to write in inquiring about what went wrong and whether mitigation steps to prevent future outages are in place. In this case, WebCorp’s response was:

It would appear that the cache flushed right before the system tried to restart. That flush wiped the contents of a Varnish file, which caused Varnish to restart with an error. That probably doesn’t mean much to someone on your end of things. Essentially, it was a really unusual conflict of a couple of automatic jobs happening on the server, so we’re fairly sure it’s not something you’ll be able to reproduce from your end of things, intentionally or unintentionally. Hope that clarifies a bit!

While I appreciate the effort to lift the curtain a little bit on some of the technical details, this response doesn’t actually tell me how WebCorp is going to prevent the issue from happening again. And so I asked them what they planned to do to prevent future such outages. 

WebCorp’s response:

We try our very best to prevent these things from happening. In order to be better prepared for a situation like this in the future, we’ve added extra monitoring […]. Now, our emergency support team will be immediately alerted whenever any downtime happens […].

Since the issue [your service] encountered is one that we have no record of having seen (either before or since), it might be premature to alter our Varnish caching processes at this time. If the issue proves to be reproduce-able and / or widespread, then we may indeed make an adjustment to our infrastructure to correct for it. For now, though, it appears to be an isolated incident.

While you do have a 99.999% SLA with us, it is actually for a different […] service! The SLA agreement is tied to [Service2] and not [Service1]. However, you may be pleased to hear that the uptime of [Service1] has been at 99.87% over the last month!

Again, I apologize for the downtime yesterday. I hope this answers the questions you had for me and the rest of my team. If not, please feel free to reach out again so we can continue the conversation. I’m always happy to help!

Again to WebCorp’s credit, this is an undeniably polite and professionally written response. The substance, however, did little to reassure me on a technical level.

What I read in the response, substantively, is:

We’re doing our best and will add more ‘monitoring’. In fact, our support team will now actually find out when downtime occurs. But this specific issue has never happened before, so it’s not in our interest to change business practices. Oh and as a reminder, the 99.999% SLA we have for your service doesn’t technically apply here and this service been at 99.87%. Isn’t that great?

By signing and paying for a five-nines SLA, my expectation as a customer is to have as close to 99.999% uptime as possible for all services WebCorp offers. The fact that WebCorp’s response seems to indicate that they find 99.87% to be a good uptime percentage serves to dramatically reduce the trust that I have in WebCorp’s future reliability. A far more reassuring response would indicate that they take all downtime seriously, that their team is investigating ways to improve the robustness of the system to ensure no customer experiences these outages ever again and that they would reply to me in a few days when they understand exactly what went wrong in their procedures and how they’ll be improving.

In summary:

1) It is important that the vendor and customer have aligned expectations for service reliability.
2) If the vendor offers a contractual SLA, the customer’s expectation is that the vendor will make good faith best efforts to meet that SLA, and take any breaches seriously.

RCA The Right Way

By not performing and being transparent about a detailed RCA, it’s easy for a customer to lose faith in a company’s efforts to provide a highly-reliable service. The goal of the RCA is therefore twofold:

1) Document the failure and potential mitigations to improve service quality and reliability.
2) Provide a mechanism for being transparent about failures to build confidence, and trust, with customers.

A good RCA has a template roughly as follows:

Incident Start Time/Date:
Incident Received Time/Date:
Complete Incident Timeline:
Root cause(s):
Did we engage the right people at the right time?
Could we have avoided this?
Could we have resolved this incident faster?
Can we alert on this faster?
Identified issues for future prevention:

In this template are prompts for the pieces of information one needs to understand what happened, what was learned, and why it won’t happen again. There are many great examples of RCAs out there:

https://blog.github.com/2012-12-26-downtime-last-saturday/
https://medium.com/netflix-techblog/lessons-netflix-learned-from-the-aws-outage-deefe5fd0c04
http://www.information-age.com/3-lessons-learned-amazons-4-hour-outage-123464916/
https://slackhq.com/this-was-not-normal-really-230c2fd23bdc

Kaizen, the Japanese term for ‘continuous improvement’ is an ethos often cited in industry. Building technology is hard; humans are imperfect and therefore, technology often is as well. That’s expected. The only way we get past our imperfect ways is to continuously work to get better, to own up to our mistakes, learn from them, and ensure we (and our technology) don’t make the same mistake twice.

The Commandments of Good Code according to Zach

    1. Treat your code the way you want others’ code to treat you
    2. All (ok most) programming languages are simultaneously good and bad
    3. Good code is easily read and understood, in part and in whole
    4. Good code has a well thought out layout & architecture to make managing state obvious
    5. Good code doesn’t reinvent the wheel, it stands on the shoulders of giants
    6. Don’t cross the streams!

Treat your code the way you want others’ code to treat you

I’m far from the first person to write that the primary audience for your code is not the compiler/computer, but whomever next has to read the code (which could be you 6 months from now!) Any engineer can produce code that ‘works’, what distinguishes crap from those that are capable of writing maintainable code efficiently that supports a business long term is an understanding of design patterns, and the experience to know how to solve problems simply and in a clear and maintainable way. The rest of these commandments are all supporting lemmas of this thesis.

All (ok most) programming languages are simultaneously good and bad

In (almost) any programming language it is possible to write good code or bad code ergo, assuming we judge a programming language by how easy it is to write good code (it should at least be one of the top criteria, anyway), nearly any programming language can be ‘good’ or ‘bad’ depending on how it is used (or abused).

An example of a language that by many is considered ‘clean’ and readable is Python. Many organizations will enforce a universal coding standard (i.e. PEP8), the language itself enforces some level of white space discipline and the built in APIs are plentiful and fairly consistent. That said, it’s possible to create unspeakable monsters. For example, one can define a class and define/redefine/undefine any and every method on that class at runtime. This naturally leads to at best an inconsistent API and at worse an impossible to debug monster. Yes, one might naively think, but nobody does that! Unfortunately that is untrue, and it doesn’t take long browsing pypi before you run into substantial (and popular!) libraries that (ab)use monkeypatching extensively as the core of their APIs. I recently used a networking library (xmppy) whose entire API changes depending on the network state of an object. Imagine calling ‘client.connect()’ and getting a MethodDoesNotExist error instead of HostNotFound or NetworkUnavailable.

As an example of a language that by many is considered ‘dirty’ but can be quite pleasant is Perl. Now, I know the previous sentence will cause a lot of controversy, so rather than fend off the pitchforks myself, I’ll refer you to Dave Cross, a proper Perl expert who very eloquently discusses this very topic.

Good code is easily read and understood, in part and in whole

Good code is easily read and understood, in part and in whole, by others as well as the author in the future (Trying to avoid the ‘Did I really write that?’ syndrome). By in part I mean that if I open up some module or function in the code I should be able to understand what it does without having to also read the entire rest of the codebase. Code that constantly references minute details that affect behavior from other (seemingly irrelevant) portions of the codebase is like reading a book where you have to reference the footnotes or an appendix at the end of every sentence. You’d never get through the first page! Some other thoughts on ‘local’ readability:

    • Well encapsulated code tends to be more readable, separating concerns at every level.
    • Names matter. Activate system 2 and put some actual thought into names, the few extra seconds will pay dividends,
    • Cleverness is the enemy. When using fancy syntaxes such as list comprehensions or ternary operators be careful to use them in a way that makes your code more readable, not just shorter.
    • Consistency in style, both in terms of how you place braces but also in terms of operations improves readability greatly.
    • Separation of concerns. A given project manages an innumerable number of locally important assumptions at various points in the codebase. Expose each part of the codebase to as few of those concerns as possible. Say you had some kind of people management system where a person object may sometimes have a null last name. To somebody writing code in a page that displays person objects, that could be really awkward! And unless you maintain a handbook of ‘Awkward and non obvious assumptions our codebase has’ (I know I don’t) your display page programmer is not going to know last names can be null and is probably going to write code with a null pointer exception in it when the last name-being null case shows up. Instead handle these cases with well thought out APIs that different pieces of your codebase use to interact with each other.

Good code has a well thought out layout & architecture to make managing state obvious

Sublemma: State is the enemy. It is the single most complex part of any application and needs to be dealt with very intentionally. Common problems include database inconsistencies, partial UI updates where new data isn’t reflected everywhere, out of order operations, or just mind numbingly complex code with if statements and branches everywhere leading to difficult to read and even harder to maintain code. Putting state on a pedestal and being extremely consistent and deliberate in how state is accessed and modified dramatically simplifies your codebase. Some languages, Haskell for example, enforce this at a programmatic level. You’d be amazed how much the clarity if your codebase can improve if you have libraries of pure functions that access no external state, and then a small surface area of stateful code which references the outside pure functionality.

Good code doesn’t reinvent the wheel, it stands on the shoulders of giants

In 2015 we have a wealth of tools available to solve common problems — depend on them as much as possible so you can focus on solving the interesting part of your application. Think ahead of time of somebody has solved some portion of the problem you’re trying to solve. Is there something on the interwebs/github (with an appropriate license) that you can reuse? Yes, expect to make heavy modifications, but more often than not this is a time saver. Things you should not be reinventing in 2015 in your project (unless these ARE your project)

Databases.

I don’t care what your requirements are, it exists. Figure out which of CAP you need for your project, then chose the database with the right properties. Database doesn’t just mean relational databases (e.g. MySQL) anymore, you can chose from any one of a huge number of data storage models:

    • Key Value Stores e.g. Redis, Memcache
    • “No-SQL” e.g. MongoDB, Cassandra/
    • Hosted DBs: AWS RDS / DynamoDB / AppEngine Datastore
    • Map Reduce engines: Amazon EMR / Hadoop (Hive/Pig) / Google Big Query
    • Even less traditional: Erlang’s Mnesia, iOS’s Core Data

Data abstraction layers

You should, in most circumstances, not be writing raw queries to whatever database you happen to chose to use. There exists a library to sit in between the DB and your application code, separating the concerns of managing concurrent database sessions and details of the schema from your main code. At the very least you should never have raw queries or SQL inline in the middle of your application code, please wrap it in a function and centralize all the functions in a file called ‘queries.py’ or something else equally obvious. A line like users = load_users() is infinitely easier to read than users = db.query(‘SELECT username, foo, bar from users LIMIT 10 ORDER BY ID’) etc. Centralization also makes it much easier to have consistent style in your queries, and limits the amount of places to go to change the queries should the schema change.

Other

    • Between S3, EBS, HDFS, Dropbox, Google Drive, etc. you really shouldn’t spend much effort or mental energy on storage now a days.
    • Take your pick of queueing services provider — ZeroMQ/RabbitMQ/Amazon SQS

Don’t cross the streams!

There are many good models for programming design, pub/sub, actors, MVC etc. Choose whichever you like best, and stick to it. Different kinds of logic dealing with different kinds of data should be physically isolated in the codebase (again, this separation of concerns concept and reducing cognitive load on the future-reader). The code which updates your UI should be physically distinct from the code that calculates what goes into the UI.

Conclusion

This is by no means an exhaustive or the perfect list of Good Coding Commandments. That said, if every codebase I ever had to pickup in the future followed even half of the concepts in this list I will have many fewer gray hairs and add an extra 5 years on the end of my life and the world would be a better place.

Credit to Scott Kyle (appden) for assistance reviewing the material in this post. Have you gotten Current For Mac yet?

The Product Market Fit Flow Chart (or PMFFC for short)

The goal of every startup, or any company producing a product that has some kind of ‘customer’, early on should be to find that magical ‘product market fit’ (herein PMF because I’m lazy). According to Wikipedia:

Marc Andreessen was the first person that used the term: “Product/market fit means being in a good market with a product that can satisfy that market.”

I was recently being recruited by a company, but turned them down because of what I saw was a big business mistake. They had PMF, but they weren’t doing anything with it.

Allow me to introduce what I believe should be on a laminated pamphlet and given to every entrepreneur in the valley:

PMF

The bit that was missing in the company I was talking to was the Run With It stage. By that I mean make your customers happy and as quickly as possible get as many as possible! This particular founder told me about the customers he already had, and about his 8-9 month roadmap to hire engineers and build product. I asked how many sales and support people he would hire in that time, he said zero.

By the roadmap he was suggesting in 2016 they would be taking in maybe ~$10k monthly, have an engineering team of ~10, a sales team of 1 (the founder) and an annual burn of >$1MM. An alternative reality, would be by 2016 to have an engineering team of 5, a sales and support team of 3-5, taking in $100K monthly and be on the verge of break even and able to raise a kickass series A.

I’m a technical person by training, and 5 years ago I’d probably be making the same mistake as this founder.  It’s really unnatural to think about sales, and it’s easy to think of them as unskilled workers you can pickup by the dozen at a moments notice.  In reality, this couldn’t be further from the truth.  Sales is hard, very hard, and you need to be working on it as early as possible.  I consider it one of the most important lessons I’ve learned as an entrepreneur to never undervalue or under-prioritize your sales and distribution strategy.  Once you’ve achieved PMF, or are even within long distance sonar range scanners of PMF, you should be thinking really hard about how to sell the bananas out of what you’re building.