Amazon One Medical: what will happen to your health data?


“Google App Engine wasn’t really built to support very large implementations,” Hunter, who joined the company in late 2016 from AWS, told Protocol. “We were finding bugs or scaling issues when we were in our large scale times like New Years Eve. We were working really hard with Google to make sure we were scaling it appropriately, and sometimes it was having issues they hadn’t seen before as we were scaling beyond what they had seen other clients using.

Today, less than 1.5% of Snap’s infrastructure relies on GAE, a serverless platform for developing and hosting web applications, after the company split its backend into microservices supported by other services inside Google Cloud Platform (GCP) and added AWS as its second cloud computing provider. Snap now picks and chooses workloads to place on AWS or GCP as part of its multicloud model, playing the competitive advantage between them.

Project Annihilate FSN came with the recognition that microservices would provide much more reliability and control, especially from a cost and performance perspective.

“[We] basically tried to make the services as narrow as possible and then backed up by one cloud service or multiple cloud services, depending on what service we were providing,” Hunter said.

Snapchat now has 347 million daily active users who send billions of short videos, send photos called Snaps or use its augmented reality lenses.

Its new architecture resulted in a 65% reduction in compute costs, and Hunter said he came to deeply understand the importance of having competitors in Snap’s supply chain.

“I just think vendors perform better when they have real competition,” said Hunter, who left AWS as vice president of infrastructure. “You just get better prices, better features, better service. We’re cloud native, and we intend to stay that way, and it’s a big expense for us. We save a lot of money by having two clouds.

The Annihilate FSN process was not without at least one failed hypothesis. Hunter mistakenly believed that Snap could write its apps on a single layer and that layer would use the cloud provider best suited for a workload. It proved far too difficult, he said.

“Clouds are different enough in most of their services and change fast enough that it would have taken a giant team to build something like this,” he said. “And none of the cloud providers were at all interested in us doing that, which makes sense.”

Instead, Hunter said, there are three types of services he examines from the cloud.

“There’s one that’s cloud-agnostic,” he said. “It’s pretty much the same no matter where you go, like blob storage or [content-delivery networks] or raw compute on EC2 or GCP. There’s a bit of fine tuning if you’re doing raw computing but, overall, these services are all about equal. Then there’s kind of a mixed up thing where it’s essentially the same thing, but it really takes some engineering work to change a service to run on one provider versus the other. And then there are things that are very cloud-specific, where…one cloud has it and the other doesn’t. We need to do this process to understand where we are going to spend our engineering resources to make our services work on any cloud.


Snap’s current architecture has also resulted in reduced latency for Snapchatters.

In its early days, Snap had its back-end monolith hosted in a single region in the middle of the United States – Oklahoma – which impacted performance and users’ ability to communicate instantly. If two people living a mile apart in Sydney, Australia, were sending Snaps to each other, for example, the video would have to go through the Australian terrestrial network and an undersea cable to the United States, be filed on a server in Oklahoma and then go back. to Australia.

“If you and I are in conversation, and it takes a few seconds or half a minute for that to happen, you’re out of the conversation,” Hunter said. “You may come back to this later, but you missed this opportunity to communicate with a friend. Alternatively, if I only have the messaging stack inside the Sydney data center… you now run two miles of landline to a data center that’s practically right next to you, and the whole transaction is so much faster.

If I want to experiment and move something to Sydney, Singapore or Tokyo, I can do that.

Snap wanted to regionalize its services where it made sense. The only way to do that was to use microservices and understand which services were useful close to the customer and which weren’t, Hunter said.

“Customers benefit from being physically close to data centers because performance is better,” he said. “CDNs can cover a lot of the content that’s being streamed, but when communicating with people one-on-one — people send Snaps and Snap videos — those are big chunks of data to move across the network.”

This ability to switch regions is one of the benefits of using cloud providers, Hunter said.

“If I want to experiment and move something to Sydney or Singapore or Tokyo, I can do that,” he said. “I’m just going to call them and say, ‘OK, we’re going to put our messaging stack in Tokyo,’ and the systems are all there, and we’re trying. If it turns out that it doesn’t really make a difference, we deactivate this service and move it to a less expensive place.

delta strength

Snap has built over 100 services for very specific functions, including Delta Force.

Back in 2016, every time a user opened the Snapchat app, they downloaded or re-downloaded everything, including stories that a user had already viewed but hadn’t yet expired in the app.

“It was… a naive deployment of just ‘downloading everything so you don’t miss anything,'” Hunter said. “Delta Force goes and looks at the client…finds out all the stuff you’ve ever downloaded that’s still on your phone, then only downloads stuff that’s new on the net.”

This approach had other advantages.

“Of course, that turns out to make the app faster,” Hunter said. “It also costs us much less, so we have reduced our costs significantly by implementing this unique service.”


Snap uses open source software to build its infrastructure, including Kubernetes for service development, Spinnaker for its application team to deploy software, Spark for data processing, and memcached/KeyDB for caching. “We have a process to review open source and make sure we’re sure it’s safe and it’s not something we wouldn’t want to deploy in our infrastructure,” Hunter said.

Snap also uses Envoy, an edge and service proxy and universal data plane designed for large microservices service-mesh architectures.

“Actually, I feel like…the way of the future is to use a service mesh on top of your cloud to basically deploy all your security protocols and make sure you have the right connections and people don’t have access to it. shouldn’t,” Hunter said. “I’m happy with the Envoy implementations that give us a great way to handle the load as we move between clouds.”

Cloud Primitives, “Fast Travel” and Cost Camp

Hunter prefers to use primitive or simple services from AWS and Google Cloud rather than managed services. One Snap philosophy that serves him well is the ability to move very quickly, Hunter said.

“I don’t expect my engineers to come back with perfectly efficient systems when we release a new feature that has a service in the background,” he said, noting that many of his team members were working previously for Google or Amazon. “Do what you have to do to make it known, let’s move quickly. Be smart, but don’t spend a lot of time tuning and optimizing. If this service isn’t taking off and it’s not being used much, leave it as is. If this service takes off and we start using it a lot, let’s go back and start fixing it.

Our total computational cost is so large that small tweaks can yield huge savings.

It’s through this tuning process of understanding how a service works that cloud duty cycles can be reduced and translate into instant cost savings, according to Hunter.

“Our total computational cost is so large that small tweaks can give us very big savings,” he said. “If you’re not making the kind of constant changes that we do, I think it’s fine to use managed services provided by Google or Amazon. But if you’re in a world where we’re constantly making changes – like daily changes, changes multiple times a day – I think you want to have that technical expertise in-house so you can really be on top of things.”

Three factors factor into Snap’s ability to save money: competition between AWS and Google Cloud, Snap’s ability to cut costs through its own work, and to go back to cloud providers and review their new products and services.

“We are in a state of doing these three things all the time, and between these three, [we save] several tens of millions of dollars,” Hunter said.

Every year, Snap holds a “cost camp” where it asks its engineers to find all the places where costs could possibly be reduced.

“We take that list and prioritize that list, and then I release people to go work on those things,” he said. “On an annual basis, depending on the year, that’s tens of millions of dollars in cost savings.”

Adding a Third Cloud Provider and Tips for Moving to Multicloud

Snap considered adding a third cloud provider, and it could still happen one day, although the process is quite difficult, according to Hunter.

“It’s a big lift to get into another cloud, because you have those three layers,” he said. “Things agnostic are pretty straightforward, but once you get to a mixed, cloud-specific environment, you have to hire engineers who are good at that cloud, or you have to train your team on…the nuances of that cloud .

Companies considering adding another cloud provider should ensure they have the engineering staff to do so: 20 to 30 dedicated cloud people as a starting point, Hunter said.

“It’s not cheap, and secondly, this team has to be quite sophisticated and technical,” he said. “If you don’t have a big deployment, it’s probably not worth it. I think of a lot of customers that I used to serve when I was at AWS, and the vast majority of them, their implementations…served the internals of their business, and it wasn’t huge. If you’re in this boat, it’s probably not worth the extra work it takes to do multicloud.


Comments are closed.