Week 2: Into the Clouds
This blog is much shorter than last week's! My college roommate was in town for 3 days, and we spent our evenings recreating an automated version of a Halo 3 "trash compactor" mode I used to love playing many years ago.
Product Exploration
With a data ecosystem being the target product, I spent some of this week checking out new data tools and products at a high level. There's still a lot of market research to do here, but here are some of my findings:
-
Snowflake Marketplace has over 2000 data sources available, the majority of which are free. This will be a good place to quickly grab data to experiment with, or maybe even to refine and re-serve. What fascinates me is the paid datasets in this marketplace. Definitely an outlier, but Good Boy Studios serves a single data table about Dogs in the USA for a mere $12,500/month. Yes, you read that right. Several providers charge on a per-query basis, such as the much-more-practical SEC Financial Analytics Kit at $1/query.
-
Hex to me until now has been a cloud-hosted environment where developers can collaborate on and share Jupyter notebooks. I watched a recently released video from Hex this last week and I wasn't immediately blown away by the features demoed. Earlier today, I connected with my buddy Nick Stanzione*, who encouraged me to give Hex another chance. After chatting with Nick, I found an awesome video about Hex magic. If this short video doesn't result in a jaw drop, you can go ahead and unfollow me. Seriously - this is some incredible stuff, and it's exactly the kind of auto-BI tool I'm looking for. I'll definitely be taking Hex for a spin in the coming months.
-
qdrant is a tool I haven't given the attention it deserves yet, but vector databases are an interesting concept I'll be keeping in my back pocket for future recommendation algorithm use-cases.
*By the way, Nick coincidentally is jumping into AWS, dbt, and data orchestration right now too! I'm excited to hear his takeaways on what has worked well for him, and what we should both be looking to steer clear of. Check out what he's working on in his SeattleWeather GitHub repo.
Cloud Guru AWS Solutions Architect Course
Since last week, I covered Route 53, Elastic Load Balancing, Monitoring, High Availability and Scaling, Decoupling Workflows, Big Data, and part of the Serverless Architecture chapter.
I've heard of SQS and SNS being used at my workplace, so it was neat to get a better understanding of the value those services offer. SNS could prove to be useful in case I want to enable job alerts in a self-hosted version of dagster.
After the overview on alerts and CloudWatch, I was briefly inspired to see if I could easily collect data about access patterns to my website. I was quickly met with a wall of logs from CloudFront which I did not care to interpret. To me, standing up this functionality is not more important than learning about the AWS big picture, so I instead carried on with the course.
The big data section of this course solidified for me that I absolutely do not want to be using AWS as my database hosting platform. One of the demos showed the instructor configuring a ra3.16xlarge Redshift instance with an estimated monthly charge of $1,201,766.40. I thought the Good Boy Studios pricing was outlandish, but Redshift is on a whole other level. This OMG moment sent me straight back to my comfort zone with Snowflake, where I confirmed that I'll be able to support all of my storage and compute needs for only a few dollars a month. Snowflake allows my warehouse to hibernate for free when I'm not running queries, so I shouldn't need to consume more than $5/week until I need to run queries several times per day.
Similarly, the "orchestration" offered by AWS Step Functions is wayyyyy more involved than I want to get in building out a generalized data platform. With self-managed EC2 instances, Elastic Load Balancers, Elastic MapReduce, and AWS Data Pipelines, problems are being solved which just don't exist in the serverless ECS environment I've grown accustomed to. Data storage only continues to become cheaper, and Snowflake's many data extraction capabilities lend themselves well to an ELT-first approach. I have no need to clean up or reduce my source data before it lands in my warehouse. Do I need to parse out a field full of 8MB JSON into a tabular view? Not a problem for Snowflake.
Lastly, Amazon AppFlow was new to me, but it's just a less-built-out version of Fivetran or Airbyte. It may be cheaper than subscribing to one of those more fully-featured services, so I'll still consider it if I need to ingest data from one of the supported integrations.
What's Next?
Some really exciting lessons are coming up in the Cloud Guru course. I'll be taking a deeper dive into serverless architecture(<3), front-end web, CloudFormation for Infrastructure-as-Code, cost optimization, and migration. Thanks for following along, and I'll see you in next week's post!