Skip to content

A New Beginning

Background

For a number of reasons, I became obsessed with an MMORPG called Old School RuneScape in the latter half of 2023. OSRS is a game that brings a sense of achievement, but it also consumes so much time that my life was becoming unbalanced.

At the same time, I was spearheading the implementation of a data orchestration platform called dagster at work. This project has been incredibly exciting, and has already had a profound impact on the way we move and track data within our organization. Several challenges arose with implementing dagster, many of which were related to standing up a hybrid-hosted cloud deployment. An engineer at heart, I find myself wanting to understand how everything I was building actually works at a finer level.

A week ago today, I officially dropped OSRS and began a new chapter in my life: a journey into the Cloud.

Before you read further - some things to note:

  • I have a tendency to blabber on when I'm excited. I appreciate you bearing with me, and I welcome any feedback you have about my writing style or approach to development and growth
  • Traditionally I've been the type of perfectionist who will commit to doing something the correct way all the way through, or not at all. With this project I'm trying to focus on making something decent, and I'll either circle back to refine it later or I'll find that it just wasn't important to me in the first place
  • Be prepared for a ton of technical jargon in here. Future posts may not be quite as in-depth, but I personally would want to know about each of the elements I've covered below if I were starting from scratch. I'll include links for everything I believe is worth taking a deeper look into.

Where I'm Going

long_journey

Perhaps I'm trapped in an echo chamber and I'm succumbing to confirmation bias, but I've been seeing technical articles float through my feeds on LinkedIn and through my inbox on a regular basis where developers at startups are implementing the same data architecture approach that my company has been using over the last couple of years (example here).

Our emerging tech stack is reliable, performant, and just plain fun to work with. Seeing upcoming tech firms adopting a similar data architecture leads me to believe that this combination of tools will lead us to build something truly incredible.

Ultimately, I'd love to implement an inexpensive, fully-scalable data platform for various initiatives:

  • crowd-funded, open-source environmental, socio-economic, and educational data for public use such as Data Science for Social Good.
  • subscription model data for commercial use cases like real-estate arbitrage, social media insights, consumer analysis
  • ad-funded data for "fun" use-cases like my BrawlStars dashboard, Music Bingo game, or maybe a hiking analytics application

The Tools

  • Python, the tried-and-true 'one-size-fits-all' programming language, is going to fill several needs in the strohb tech stack
  • dbt, a python "data build tool", will serve as the fundamental package for maintaining and testing data models
  • dagster, strohb's data orchestration tool of choice, will execute dbt models and tests, as well as software-defined assets upstream and downstream of our database
  • ECS or EKS, a scalable service to allow self hosting the compute and logging sides of the strohb platform
  • ChatGPT & Github Copilot - although not directly involved in serving solutions, these tools will be critical in helping accelerate the development of all other aspects of the strohb platform
  • git, I plan to version control as much of the strohb platform as I can
  • github actions, which I'll use to deploy updates to the platform as well as run code validation and quality checking tests

What's Missing

  • data ingestion tool. Writing custom integration functions isn't challenging, but maintaining these integrations over time will grow tiresome. Something like airbyte would make it easier to ingest data from all different types of sources, but more research is needed.
  • database selection. We're thinking big data here, so an OLAP solution would be more ideal than OLTP. Will likely use a cheaper OLTP solution like PostgreSQL until there is a need and funding to migrate to a more performant solution like Snowflake or Redshift.
  • BI dashboard tool. I'm a huge fan of streamlit, and self-hosting streamlit apps shouldn't be an issue after developing a solid understanding of AWS solutions.
  • Auto ML solution. Less than 3 years after earning my Master's degree in Data Science, the irony of being directionless for an ML solution is comical. The data science ecosystem evolves more rapidly than any other aspect of this tech stack, and if I do research now to declare a front-runner candidate, that's only going to change by the time that I'm actually looking to leverage ML to enhance the data products. Keras, sklearn and (py)caret were big when I was taking classes, but I'm sure even better tools will emerge as the strohb data platform matures.

Progress Update

After starting the A Cloud Guru Solutions Architect - Associate course last week, I immediately was given a deeper dive of several resource types I had thought I was already familiar with at this point. Boy was I wrong!

The Course

This week I covered IAM, S3, EC2, EBS, EFS, Databases, and VPC.

IAM seems pretty straightforward. It involves users, roles, and groups, each of which can have access policies applied.

S3 is so much more capable than I knew. Not only is s3 used as a general purpose file storage, but it also integrates with many other AWS resource types. CloudFront can publish logs to an s3 bucket, s3 can serve as a static web host, s3 databases can be configured to store point-in-time snapshots in s3, etc. s3 buckets have version control

EC2 also has more depth than I previously thought. Creating AMIs for use with new EC2s from EBS blocks from other EC2s is a fascinating concept. Most of the security with allowed inbound and outboard ports is still way over my head, and the thought of architecting an effective VPC from scratch both terrifies and bores me. ECS was one of my first exposures to AWS, but as I understand it, networking many EC2s together was the go-to method for distributed computing in the cloud prior to ECS. Being a privileged Python enjoyer, I'm not used to having to clean up after myself, and I know that I would forget and leave my network of EC2 instances running for weeks after I was done using them. Yikes!

In the database chapter of the course, it was interesting to see how much more performant AWS Aurora is compared to other OLTP solutions. Aside from a hacky from-scratch json database system I built and quickly scrapped over a year ago, I haven't had much exposure to NoSQL databases like MongoDB and DynamoDB. Amazon Neptune and Timestream might also prove to be useful for future data products on strohb.com.

The Platform

AWS

Shortly into the course, I was inspired to host my own website to document my Cloud adventures, use as a sandbox environment, and eventually to serve as a central interface to the data hub. Intrigued by the cost-effectiveness of S3 static sites, I set out to create my own publicly-facing s3 bucket behind a domain.

Having purchased my last domain (expired) through Google Domains, I thought to check there first. Boom. Gone. Google Domains is no more, and google recommends purchasing a domain through Squarespace now. I go through the motions, and now I'm the ~~proud~~ owner of bstroh.com! Super! After a little guidance from ChatGPT, I've set a CNAME record in my DNS settings to point to my bstroh-website s3 bucket with my index.html and my error.html pages. After several attempts to bring my site online, I'm still getting a "no such bucket" 404 error when trying to access bstroh.com.

After more reading (but not enough), I'm seeing that "Alias" records are more effective than CNAME records for redirecting traffic to my S3 bucket on AWS, but Alias records are an AWS-only feature. Here I go for the 3rd time in my life purchasing a domain name from one registrar and almost immediately regretting that I didn't register with a different host. So I go to transfer my bstroh.com domain to AWS's Route 53 service... denied. I need to wait 60 days after registering before I can transfer a domain! This is getting ridiculous. Thinking I can leverage Infrastructure as Code (IAC) to easily transfer over what I build to my preferred domain in 2 months, I decide spend another $13 on a new domain, strohb.com. Back to trying to set up an Alias record for my domain to redirect traffic to my s3 bucket, I stumble into AWS Route 53 Policy Records. AWS warns me that each record is going to cost me $50/month, and my internal spark flickered intensely. I was not looking to spend that kind of money on what felt like a very simple configuration option, and I may soon give up if AWS Services are going to cost that much.

More searching, chatting with AI, I finally find a poorly-named service called "hosted zones". A few clicks later, and I'm all set! At least I thought. I'm still seeing that same error "no such bucket" on my new site! Turns out that unless I'm using CloudFront, my s3 bucket name must match my domain name exactly. I create a new bucket named www.strohb.com, mirror all permissions from my bstroh-website bucket, and my site finally comes online! The performance is dreadfully slow, my site is insecure, and my site only works if I use "www" at the beginning of the URL - which is lame. So I jump right into AWS CloudFront, which AWS has been pushing me towards this whole time.

Initially, I was hesitant about CloudFront because the multitude of enhancements it offers for deploying websites seemed like it would be expensive. With my static website, I decided that paying for a Web Application Firewall is not worth the $5/month for me right now, which brings the cost of using CloudFront down to, uh, $0. Issuing a security certificate so my browser stops trying to protect me from my website takes no time at all, and I'm feeling good. I end up having to revisit my CloudFront and security certificate to function for both www.strohb.com and just plain strohb.com, but then my AWS site infrastructure is complete! That took way more effort than A Cloud Guru originally made it out to be, but I learned a lot along the way and only paid an extra $13 for my ignorance. My monthly costs are estimated to be less than $0.10 now, which feels great compared to other website hosts like wix.com and wordpress, where I'm paying over $100 for a complicated website I don't understand.

The Blog

There's no chance I'm maintaining a web of from-scratch HTML, CSS, Javascript files for this website. That's where mkdocs comes in. With just one yaml file for configuration, I can write all my blog content using markdown. mkdocs dynamically builds all of the styling and structured web files I need to host an easy-to-maintain-but-still-pretty website.

Github

After archiving or deleting each of my 33 stale repositories on github, I made a new repository for this project. I committed my mkdocs project as well as a configuration file which controls my pre-commit code quality checks. I also added a pyproject.toml which will make it easy to manage my python dependencies using poetry when I'm ready to work on the dagster and dbt portions of this platform. Lastly, I asked ChatGPT to generate a Github Actions template to allow me to redeploy my mkdocs site to my s3 bucket each time I push to my repository's main branch on Github.

Having never used Github Actions before, I was amazed to find that the generated template worked for me after only 1 minor tweak. I've used Azure DevOps's pipelines (Actions counterpart) before, and I find myself running iterations of those templates several dozens of times before they perform as I intend them to. The last step for me here was to create a new user in AWS that has PutObject permissions to my s3 bucket, generate a set of credentials for that user, and store those credentials in Github secrets.

Website Cache

My Github Action successfully built my site and deployed my code to the strohb.com S3 bucket, but my website didn't look any different in my browser. Turns out my CloudFront distribution was caching my website too effectively. Within my CloudFront distribution, I was able to issue an invalidation to clear my website's cache, and the browser was now showing the results of my mkdocs deployment!

What's Next

In two months, I'll be porting this site over to bstroh.com, which will become my permanent internet home. You can expect weekly blog updates here for the foreseeable future! Before trying to build out more useful tools I'm going to cruise along with ACloudGuru's course for the Solutions Architect Associate Exam, and then move on to the Developer Associate Exam. Upcoming chapters include Route 53, Elastic Load Balancing, Decoupling, Scaling, Big Data and Serverless Architecture. I can't wait to learn more!