Durable computing: What is it and why now?

Summary

This episode of the ThoughtWorks Technology Podcast explores the concept of durable computing, a set of platforms and patterns focused on building resilient distributed systems. Host Alexei is joined by guests Brandon Cook and John Coleman to discuss why this topic is gaining relevance now, particularly as architectures become more distributed and complex.

The conversation begins by defining durable computing as the ability for a program to recover its state and continue execution from where it left off after a failure. The guests explain how these platforms handle coordination, guarantee process completion, and manage complex event-driven patterns like event replay and graceful failure recovery. They trace the origins of these concepts to foundational database principles like ACID transactions and note how companies like Uber, Airbnb, and Netflix have driven recent platform development.

Key considerations for teams evaluating durable computing platforms are discussed, including hosting models (SaaS vs. self-hosted), language support, workflow complexity, and the trade-offs between vendor lock-in and platform capabilities. The guests highlight specific platforms like Temporal, Restate, AWS Step Functions, Azure Durable Functions, and emerging options like Golem, noting differences in granularity and approach.

Practical challenges are explored, including testing strategies that must account for asynchronous execution, latency implications during recovery, and the critical importance of idempotency in consumer applications. The episode concludes with a look at emerging applications in AI agent orchestration (“durable agents”) and recommendations for getting started with these technologies through cloud provider offerings or open-source platforms.

Recommendations

Concepts

Idempotency — Critical consideration for consumer-side processing in durable systems to ensure operations can be safely retried without duplicate side effects, especially important during failure recovery.
Durable Agents — An emerging pattern combining durable computing with AI agent orchestration, allowing agents to handle failures in LLM calls, RAG operations, and human-in-the-loop interactions.

Platforms

Temporal — A durable computing platform that originated from Uber’s Cadence system, noted for its growing testing support and use in building resilient distributed workflows.
Restate — A platform mentioned as being easier to deploy as a single binary, positioned as a more lightweight option compared to some alternatives.
AWS Step Functions / Durable Lambdas — Cloud provider offerings that provide durable execution capabilities, mentioned as accessible starting points for experimentation despite potential vendor lock-in.
Azure Durable Functions — Microsoft’s cloud-based durable computing solution, suggested as an accessible option for teams already using Azure services.
Golem — A newer, experimental platform taking a radical and different approach to durable computing, though noted as not yet production-ready.
Akka — An actor framework with replay capabilities that predates the term ‘durable computing’, mentioned as a lighter-weight alternative to more opinionated platforms.

Topic Timeline

00:00:00 — Introduction to durable computing and guests — Host Alexei introduces the topic of durable computing and welcomes guests Brandon Cook and John Coleman. Brandon shares how he encountered the concept during a client assessment focused on event-driven architecture patterns, noting that teams often handle happy paths well but struggle with failure recovery in distributed systems.
00:03:08 — Defining durable computing and its guarantees — John Coleman explains durable computing as a program’s ability to recover state and continue execution from where it left off, guaranteeing process completion. He describes how platforms vary in granularity—some recover fine-grained memory state while others replay effects—but all aim to coordinate workflows and handle failures in distributed contexts. The discussion contrasts this with traditional orchestration and saga patterns.
00:08:26 — Industry adoption and platform origins — The guests discuss how durable computing emerged from companies like Uber (Temporal/Cadence), Airbnb, and Netflix facing distributed system challenges. Brandon notes that organizations built internal platforms for resiliency and are now democratizing them. John adds historical context with Akka actors and database ACID transactions, highlighting how these ideas have evolved into cohesive platforms.
00:12:43 — Platform trade-offs and vendor lock-in — The conversation turns to trade-offs between lightweight frameworks (like Akka) and more opinionated platforms (like Kalix or Axon). Brandon explains that newer platforms offer more flexibility in language and design while providing durable execution backends. Cloud provider offerings like AWS Step Functions and Azure Durable Functions are mentioned as options that may lead to lock-in but offer deep integration.
00:14:26 — Key evaluation criteria for teams — Brandon outlines important considerations when choosing a durable computing platform: hosting model (SaaS vs. self-hosted), language support and SDK idioms, and understanding business workflows (long-running processes, fan-out/fan-in patterns). John adds that platforms differ significantly in capabilities, so teams must assess which fits their specific use cases.
00:17:54 — When not to use durable computing — Brandon advises that durable computing may be overkill for systems without high scalability or recoverability requirements. The decision should be business-driven, considering whether failures can be tolerated without significant impact. Teams should weigh the cost and complexity against actual needs.
00:19:00 — Testing challenges and strategies — Testing durable systems requires different approaches because code appears synchronous but is inherently asynchronous and distributed. Brandon notes that platforms like Temporal are improving testing support to avoid spinning up full infrastructure. John shares his experience using Docker containers for proof-of-concept testing but acknowledges the complexity of integration testing.
00:22:39 — Latency, idempotency, and versioning considerations — John explains latency implications: some platforms replay from the beginning (adding latency) while others recover memory state more granularly. Brandon emphasizes the critical importance of idempotency, especially on the consumer side, to prevent duplicate processing during replays. Both discuss versioning challenges for long-running workflows and compensating for business logic errors.
00:29:42 — Developer mindset shift and debugging — Developers must shift from request-response thinking to event-driven, long-running process models. Debugging changes from examining stack traces to analyzing event logs and workflow histories. Brandon notes this requires understanding workflow interactions and accepting delayed feedback compared to immediate responses.
00:31:16 — Connection to AI agents and durable agents — The guests explore how durable computing enables “durable agents” for AI orchestration. Platforms like Temporal, Restate, and Vercel’s offerings help manage failures in LLM calls, RAG operations, and human-in-the-loop interactions. These systems allow agents to tear down and restart when resources become available, making AI agent development more robust.
00:34:09 — Getting started and parting thoughts — Brandon suggests starting with cloud provider offerings (AWS Durable Lambdas, Azure Functions) for accessibility, then exploring more feature-rich platforms like Temporal or Restate. John mentions his GitHub POC with Temporal and experimentation with Golem. Both encourage hands-on exploration to understand how these platforms fit specific distributed system needs.

Episode Info

Podcast: Thoughtworks Technology Podcast
Author: Thoughtworks
Category: Technology Business Careers
Published: 2026-03-05T07:00:00Z
Duration: 00:37:42

References

URL PocketCasts: https://pocketcasts.com/podcast/65805520-c875-0131-2aed-723c91aeae46/episode/393efb7c-95c6-443c-aa3a-290b8e7c01bb/
Episode UUID: 393efb7c-95c6-443c-aa3a-290b8e7c01bb

Podcast Info

Name: Thoughtworks Technology Podcast
Type: episodic
Site: https://www.thoughtworks.com/podcasts
UUID: 65805520-c875-0131-2aed-723c91aeae46

Transcript

[00:00:00] Hello and welcome to the ThoughtWorks Technology Podcast.

[00:00:11] My name is Alexei, I am one of your regular hosts and I’m speaking to you from Sao Paulo in Brazil.

[00:00:18] And this time around, we’re here to talk about durable computing.

[00:00:22] And I am thrilled to have with us Brandon Cook and John Coleman to help us navigate through this very interesting topic.

[00:00:30] Hello to both of you.

[00:00:31] Brandon, maybe you, would you mind introducing yourself?

[00:00:35] Yeah, I’m Brandon Cook, Principal Software Engineer at ThoughtWorks based out of New York.

[00:00:41] Excited to chat today.

[00:00:43] It’s amazing to have you with us. Thank you so much for joining.

[00:00:46] And how about you, John?

[00:00:47] Hi, yeah, my name is John Coleman. I’m from the Bangkok ThoughtWorks.

[00:00:52] Office and I’m a lead consultant.

[00:00:55] Amazing. Thank you so much for being here with us to talk about this.

[00:00:59] And so it’s maybe we can get started by talking about some motivation if you don’t mind.

[00:01:05] So, I mean, the foundations of the topic go back to the 70s, right?

[00:01:10] So asset properties to face commits.

[00:01:13] Why are we discussing these things now in the context of distributed systems?

[00:01:17] So what has changed or what is relevant to the topic at the moment?

[00:01:22] I got started in this sort of durable computing space after an assessment with a client.

[00:01:30] They were focused on, oh, are we building the right event driven architecture patterns in our system?

[00:01:38] Right. And what we were finding is that they had all the good patterns and principles nailed down for like more or less the happy path.

[00:01:46] But we’re really lacking in sort of the sad paths or the failure paths.

[00:01:52] And then that’s kind of where a lot of the durable computing comes into play, sort of building in those sort of complex event driven patterns like event replay, recovering from different failures gracefully, being able to sort of continue things in your distributed system.

[00:02:13] If something goes down, that’s kind of like the main key outcomes that you get from durable computing offloading all of that operational.

[00:02:22] Yeah, that’s cool.

[00:02:37] That makes a lot of sense.

[00:02:38] Yeah.

[00:02:38] So when the architectures are becoming more and more distributed in a way, so more microservices and the scale of those kinds of things.

[00:02:49] So, but maybe before we go any further.

[00:02:52] Maybe we can try to explain to the audience or define the concept itself.

[00:02:59] If we go back to the question, what is durable computing?

[00:03:03] How can we explain that in a simple way, maybe?

[00:03:08] I think there’s some slightly different angles on it.

[00:03:11] We had some discussion before about this.

[00:03:14] For me, I tend to focus on the state part of it.

[00:03:18] So it’s this ability for a program.

[00:03:22] To recover its state and continue from where it left off.

[00:03:29] And that it depends on the implementation.

[00:03:32] There’s different ways that that’s achieved.

[00:03:35] But essentially, it’s that ability to recover the process and continue from where it left off.

[00:03:43] And usually to guarantee the process completes.

[00:03:48] And the typical sort of application.

[00:03:51] If you think of a workflow is that you might want to call various systems.

[00:03:58] So you might have two or three other APIs that you want to call something like that in a sequence or chain.

[00:04:07] And you want to guarantee you get to the end and that also the chain of events is correctly handled at each step.

[00:04:16] So it’s a lot about coordination and making sure that.

[00:04:21] That something either happened or doesn’t do any harm and that you can have those kinds of guarantees right in the distributed context.

[00:04:29] Yeah, and it can be more fine grained and that some of the platforms have more sort of superpowers.

[00:04:37] So when we talk about the state, it can be that there’s levels of granularity that you can have around that.

[00:04:45] So some of the platforms are much more granular and can really recover the program.

[00:04:50] And it’s internal memory stuff more precisely.

[00:04:56] So they’re much more fine grained.

[00:04:58] Some of them are a bit more basic and they just record things like the effects and what were the results of the calls.

[00:05:07] And then they just start the whole process again, but give you the effects of the calls, something like that.

[00:05:14] So there’s a broad array of how it’s done.

[00:05:19] But it’s all aimed at the end.

[00:05:20] Aimed towards the same kind of outcome.

[00:05:24] Yeah, that’s great.

[00:05:25] Thanks, John.

[00:05:26] And I mean, just for more for the sake of clarity, what kinds of guarantees are we talking about?

[00:05:33] How is this different from a simple orchestration, for example, when we think about a workflow and or when we compare to sagas, for example, how is it different from those kinds of patterns?

[00:05:46] Well, I think there’s some overlap you’ll find similar to a terminology.

[00:05:50] Like you get assured delivery and the once only, you’ll find similar terms bounced about.

[00:06:00] So I think the concepts, nothing conceptually particularly new about this.

[00:06:04] I think it’s just the power of the platform to take away the pains of how you handle failure and retries and recovery.

[00:06:14] It’s basically taking the pain of those programs, typically programmatically.

[00:06:20] I think it’s just the power of the platform to take away the pains of how you handle failure and retries and recovery.

[00:06:23] It’s essentially like all these teams and orgs that have started these sort of durable computing platforms, having the realization that in all distributed systems, we have to solve these type of problems.

[00:06:37] And they’re essentially just focused on extracting that away for users.

[00:06:41] Yeah, nothing new, right?

[00:06:43] Like these are known problems, known things that you have to solve in distributed systems.

[00:06:48] It’s not a matter of.

[00:06:50] If you’re going to face that problem or not, it’s more of a when, and that’s where these, these platforms are really focused on saying, okay, this is going to happen.

[00:07:00] This failure is going to occur, but we’re here to help you recover from that.

[00:07:05] Yeah.

[00:07:05] If you think about the history of like how people are doing these kinds of distributed systems, quite often, if you’re, if you’re writing the code, you actually don’t really know what to do when something fails at a particular point.

[00:07:18] So you, you might just cancel the whole process.

[00:07:20] And the user gets a 500 error on their API or whatever.

[00:07:26] It just, it just breaks.

[00:07:28] And it’s quite challenging as a programmer to figure out what exactly you have to do.

[00:07:32] You may not necessarily know exactly how to solve a failure at a particular point, but if you could sort of have more assurance of the delivery, that makes your life easier as a coder as well.

[00:07:45] So if we’re developing these kinds of fallible systems.

[00:07:49] Yeah.

[00:07:50] Yeah.

[00:07:50] You’re, you’re, you’re taking a lot of the coding headache away potentially.

[00:07:55] Yeah.

[00:07:55] And it’s interesting.

[00:07:55] So Brandon, you were mentioning that you came across it when you were doing an architectural assessment and looking at platforms and John, we’re talking about some of the origins.

[00:08:05] It looks like, I mean, it was an emerging challenge in the industry and in based on the technologies we’re using these days.

[00:08:13] So we saw platforms coming out of Uber, Airbnb, Netflix, and then project by Apache.

[00:08:20] So it’s interesting to see, you know, that, uh, conjunction of factors that led to that.

[00:08:24] Isn’t that right?

[00:08:25] So.

[00:08:26] Yeah.

[00:08:26] Uh, and funny enough, after that assessment, we were able to sort of move on and sort of create a platform team that was focused on building out a lot of these common capabilities.

[00:08:38] Yeah.

[00:08:38] One of the first things that we did was basically try to assess a variety of these, these platforms.

[00:08:44] Um, and there is, there’s quite a few and a lot of considerations that need to be taken into.

[00:08:50] Um, into account, but yeah, like it is essentially a way for you to not need to have a set.

[00:08:57] Maybe you still have a platform team that knows how to host these tools or, um, help T enable teams on how to use those tools and in the most effective way, but then they don’t also have to go build all of the platform and orchestration components that are, are in the, in these tools.

[00:09:14] So as before, you can see like all these organizations who like, who came out with these platforms.

[00:09:20] If you take like temporal, for example, I think that came out of Uber’s sort of cadence, um, platform, right?

[00:09:27] These are all startups or organizations that realize this built in resiliency into distributed systems was needed.

[00:09:34] And so they built it and then now they had to maintain it and run it.

[00:09:38] And that in and of itself, it’s its own product and it becomes its own beast of itself just to say, okay, we’re building resiliency into, into the system.

[00:09:48] So now it’s like, okay, how do we.

[00:09:50] Take these platforms and, and I guess in a way, democratize it and share it with, with the rest of the industry.

[00:09:56] I worked on, uh, ACA actors, uh, quite, uh, quite a few years ago.

[00:10:02] They have, they have this sort of replay capability, so you can persist the state as a sort of a stream and a log, and then they can, they can replay that to recover.

[00:10:12] So I think, I think that’s something that’s been around quite a long time now.

[00:10:16] Um, but it was never, I don’t think anyone called it.

[00:10:19] Durable computing at the time, because it, it isn’t quite that perhaps.

[00:10:24] Um, but yeah, there’s, there’s been these ideas and like we mentioned the database before as well, you know, your acid transaction, you can commit and roll back all of that.

[00:10:36] So there’s, there’s, there’s, I think there’s been a progression, uh, to some extent of these developments and now we’ve, we’ve kindly kind of arrived at the point where it’s pulled together.

[00:10:48] You don’t have to sort of conform.

[00:10:49] To some sort of model or before it was like various pieces, but now it’s just like, we can do the computation that we can just write some business application code and we get the guarantees.

[00:11:02] We don’t have to be thinking about these different bits and pieces and making them work together sort of incoherent before.

[00:11:09] And now it’s, it’s all formed into one easier to use, uh, way of working.

[00:11:14] Yeah.

[00:11:15] It’s funny that you mentioned like ACA.

[00:11:17] Cause I think.

[00:11:18] It highlights.

[00:11:19] A lot of key contexts around some of these platforms, whether they like are like more heavily opinionated and heavyweight or more lighter weight.

[00:11:28] So I would say like ACA probably is more in the lighter weight area of the world.

[00:11:32] Cause it is more of a framework that you can build into, into your code base.

[00:11:37] But I think the, the organization is called light bend that, that builds ACA.

[00:11:41] They also had like this more heavily opinionated system called Calix as well.

[00:11:47] That does this sort of.

[00:11:49] They don’t brand it as a durable computing platform, but it does kind of the same principles, but it was like, once you’re locked in and you’re building in Calix, cause that’s the way you need to do it.

[00:12:00] Similar with other kind of like event driven platform type experiences that are trying to build resiliency and like axon or anything like that.

[00:12:10] Like some of those are like, once you start building in those, you’re kind of like stuck into, into that way of, of working in that way of designing the system.

[00:12:19] So you essentially got to become expert in that, in that framework, whereas the newer ones are, are starting to become a little more flexible in terms of how you want to work and what language you want to work in and, and being a little more lighter, lighter weight in terms of that, but still providing kind of that key durable execution backend for you.

[00:12:40] Yeah, the lock-in aspect is, is interesting.

[00:12:43] I mean, even, even a cloud providers also have some of their platforms like AWS have step function.

[00:12:49] And Azure also has some flavor of that.

[00:12:52] So that is definitely one key trade-off to, to keep in mind.

[00:12:56] Right.

[00:12:57] Yeah.

[00:12:58] Like for the org that I was working with on the assessment, like they would like, they were really itching for AWS to kind of build like durable lambdas.

[00:13:09] So like, they were like, we really want this.

[00:13:11] It wasn’t out yet, but like, I think it was like last year they announced those durable lambdas at a, I could probably.

[00:13:19] Re-invent or something like that.

[00:13:22] So I’m, I imagine that that org is probably looking, looking at those now because they were very much, very much embedded and locked into AWS, but that was a decision they’ve made from a tech technology strategy.

[00:13:35] And they were quite effective at, at deploying and, and working with AWS.

[00:13:40] So it was kind of, they were like, they saw AWS as their platform of choice.

[00:13:44] Yeah.

[00:13:44] And I mean, that’s always a trade-off, right?

[00:13:46] You, you, you’re deeper into a platform.

[00:13:48] You can leverage.

[00:13:49] You can leverage more of the resources the platform will offer you and be more efficient, but maybe I’ll ask you this, Brandon and John, feel free to, to, to add, but building on, on the experience you had doing an assessment and looking at this from, you know, making those architectural decisions and, and considering trade-offs and those kinds of things.

[00:14:09] So what should teams evaluate when looking at those technologies?

[00:14:15] So what are some of the, you know, the key dimensions that are relevant?

[00:14:18] Some of the questions, uh, you, you need to ask yourself when thinking about these kinds of things and making those architectural decisions.

[00:14:26] First, I always look at, like, as the, the, the hosting model.

[00:14:30] So like, where are you essentially going to host this, this durable computing platform?

[00:14:36] Cause it is literally going to store everything that you execute.

[00:14:41] Like some of them will literally store everything that you’re executing enterprises, like, I don’t know, financial systems and things like that.

[00:14:48] They’re not going to want to use the, the SAS version of these tools.

[00:14:53] They’re going to want to self-host.

[00:14:55] So the operational burden of, yes, you don’t have to build it, but now you still have to host it and make sure that it’s running.

[00:15:02] There’s some operational costs to running that infrastructure.

[00:15:06] So looking at the, the, the hosting model and understanding what you need to do.

[00:15:12] So if you’re a startup, you’re trying to get something going, you want to build a resilient system.

[00:15:17] Maybe that SAS.

[00:15:18] Product is something you’d reach for quickly, but if you’re trying to maybe internalize looking at the hosting models, so like looking at like, I guess a temporal versus a restate, um, from a platform perspective, maybe those folks are kind of reaching more for restate because it’s like a single binary.

[00:15:37] It’s easier to deploy.

[00:15:38] Now, I think temporal is making strides to make, I guess, their self-hosted choice a little easier to host, but I’m not sure what the developments are there.

[00:15:47] But it’s.

[00:15:48] It is part of the decision criteria that you would think through.

[00:15:51] Another thing I would sort of assess is like what languages do your teams know?

[00:15:58] Like how do they actually develop on the day to day and what’s supported by the various platforms?

[00:16:03] Cause at the end of the day, you’re going to still need to understand, like, what are these SDKs look like that, that actually integrate with those platforms and whether they look like, are your teams going to be able to understand them?

[00:16:14] Like, are they going to be able to sort of understand, I guess, the web workflows and all these.

[00:16:18] Different, um, facets of like the idiomatics aspects of all these different SDKs and languages.

[00:16:25] So I think that’s a probably key, key aspect to consider.

[00:16:30] And then, um, yeah, like what workflows do you have?

[00:16:35] I think really understanding the business domain, obviously as, as, as a first principle around, like sort of just still doing domain driven design and saying, okay, what workflows do we have?

[00:16:45] Do we have a bunch of long running processes?

[00:16:48] Do we have a clear, distinct, distinct workflow?

[00:16:51] Do we have workflows that kind of fan out or fan into certain things?

[00:16:55] I think all of those have to be considered when sort of choosing which one of these durable computing platforms to reach for.

[00:17:03] Yeah.

[00:17:04] And then there’s, you can dig into the like details of, uh, you mentioned the idioms.

[00:17:10] So there’s a lot of overlap between them.

[00:17:13] Uh, they can have slightly different idioms and slightly different applications.

[00:17:18] Some of the platforms cross over a lot.

[00:17:22] So you’ve definitely got like things you can choose from.

[00:17:25] And some of them are a little bit more distinct and have particular use cases that they might be better for, uh, versus other ones.

[00:17:33] So you have to dig in and understand, uh, what it is that the platforms are offering and how, how that fits your use case.

[00:17:41] They’re not all going to be sort of equal.

[00:17:44] In fact, they’re all quite different on the whole.

[00:17:46] Yeah, that’s great.

[00:17:47] That’s great.

[00:17:48] And is there, uh, obvious scenarios in which you should not be using durable computing?

[00:17:54] Are there a couple of factors to consider to say, Hey, not needed in this case?

[00:17:59] So, yeah, I think like, well, you got to think of what your scale looks like.

[00:18:03] What’s your uptime that you need to deal with?

[00:18:06] Can you deal with some failures and recover without hurting the business?

[00:18:11] Right.

[00:18:12] I think these are like, I know we always like put these in like the heavy technical bucket, but at the end of the day, they’re very much.

[00:18:18] Business driven decisions.

[00:18:20] So if you need high, like high scalability, high recoverability, all these different aspects, maybe the durable computing platforms is something you need to reach for.

[00:18:31] Um, if not, maybe it’s a little overkill for your system.

[00:18:35] Right.

[00:18:36] Um, and you’re, you’re spending a bunch of money that you don’t need to spend.

[00:18:40] Yeah.

[00:18:40] Cool.

[00:18:41] And any considerations to testing?

[00:18:43] So, I mean, it’s a, it’s obviously we’re, uh, the platform.

[00:18:48] Yeah.

[00:18:48] We’re leveraging the platform to do, uh, to bring a lot of resilience to the workflow and those kinds of things, but anything that we still need to be mindful of, uh, or other scenarios I need to consider.

[00:19:00] So how does that change, uh, the testing strategy overall?

[00:19:07] Yeah.

[00:19:07] I mean, the, the, the testing strategy, I think changes significantly, particularly like when, if you’re just like trying to think about, Oh, what does a normal unit test look like?

[00:19:16] Or what does a, what does like a.

[00:19:18] Full test of the whole program look like, because are you going to spin up all the infrastructure for your tests, right.

[00:19:26] To get the feedback, um, what testing support do these platforms provide you is also probably a key decision criteria.

[00:19:37] Uh, last thing I, last thing I remember is temporals, like testing support was, was, was growing significantly.

[00:19:44] Um, so like, yeah, getting that fast feedback.

[00:19:47] So you don’t have to spin.

[00:19:48] Up the whole infrastructure to really make sure something is working.

[00:19:51] Another key aspect is they made these, these systems may look like they’re like when you’re writing the code, it may feel like you’re writing synchronous code, but like at the end of the day, it’s still distributed.

[00:20:02] It’s all async.

[00:20:05] It’s all event driven underneath the hood.

[00:20:07] Maybe you don’t need to deal with the vendor and pieces a lot.

[00:20:11] So that also has an effect on, on how you want to test or how you approach testing as well.

[00:20:17] Cause it is a.

[00:20:18] Definitely a mental mind shift to test synchronous code versus, um, more event-based or asynchronous code.

[00:20:26] Yeah.

[00:20:27] I’m just thinking back.

[00:20:28] I, I did a sort of proof of concept with, uh, temporal and the way I did that is, is use it just using dock containers.

[00:20:37] So I ended up having to have a container full of the external services that I was, um, linking together in the workflow.

[00:20:47] Um, and you can, you can imagine that quite easily becoming unmanageable if you had a substantial or you had an external provider.

[00:20:57] Um, I mean, those are problems you you’re going to face anyway, if you’re doing integration testing.

[00:21:02] Um, but now you you’ve added a bit more complexity to that story as well.

[00:21:07] Yeah.

[00:21:08] Good.

[00:21:09] Yeah.

[00:21:09] Well, you open the there there’s, there’s the accidental complexity, but there’s always the essential complexity as well that you can’t remove.

[00:21:16] Right.

[00:21:16] So.

[00:21:17] And that’s, that’s part of the, the problem itself and, uh, non-solid technologies we’re using.

[00:21:22] Once you start testing locally and faking things, you are in a sort of false world as well.

[00:21:29] You, you can’t guarantee that, you know, at some point you really want to get to the production thing.

[00:21:36] Um, but you know, you always have to, to be about like progressive with your test and you, you start, you start with something local just to check the contracts and the behaviors.

[00:21:46] And then you sort of wishfully think it’s going to work like that in production as well.

[00:21:51] You know, you hope for it, it’ll, it’ll be the same, but at some point you’re going to need those other services sandboxed or something like that.

[00:21:59] Um, so yeah, you face a lot of the same challenges as regular end-to-end testing.

[00:22:06] Yeah.

[00:22:06] Good.

[00:22:07] And I, I mean, moving beyond testing, I’m also sure that, uh, you know, that resilience, those things, the platform bring don’t.

[00:22:16] They don’t come free, so they’re probably, you know, uh, things you need to be able to implement, uh, in, in the cold and, and some, some gotchas that you need to be, be mindful of.

[00:22:28] What are some of those things you, in your experience, uh, you know, what needs to be very pay close attention to when, when developing, uh, under, uh, those platforms.

[00:22:39] One of the ones that I’ve been looking at and considering is like latency.

[00:22:44] So if you.

[00:22:46] There’s various points at which it might fail.

[00:22:49] So some of the, some of the ways that the dual compute platform works are like a full recovery.

[00:22:55] So your process is going to start from the beginning again.

[00:23:00] Now it, it might be that your process, um, in theory, all the effects, the, the async stuff should happen quickly when it’s being replayed because it’s being replayed out of, of some durable states and memory or discs somewhere.

[00:23:15] So.

[00:23:16] It, it, it should progress a lot faster because it’s not doing the IO anymore, but that’s that latency is still there.

[00:23:24] So if you potentially had a very long or complex process or something with a lot of computation inside it that had to be replayed, you could still have like an additive effect to your latency.

[00:23:38] That could be quite significant.

[00:23:40] Some of the platforms, the more granular ones will recover like more of the memory.

[00:23:46] State.

[00:23:46] So they will actually gen genuinely continue the process, uh, where it left off.

[00:23:52] So you wouldn’t have that sort of replay latency coming into play.

[00:23:57] Um, the other kind of latency you can get is when the, a process itself fails, a durable platform, a node fails, and then that, that node has to be recovered, um, and given its state and that recovery process itself can, can take a while.

[00:24:15] So that, that would depend on, uh, like the, the op log or whatever it is that’s behind helping to restore the state that, that can take a while to replay as well.

[00:24:27] And there’s, there’s various strategies you can use, um, to try and tweak that on some of these platforms.

[00:24:35] Um, but you are going to get some latency side effects potentially.

[00:24:39] And of course there’s, um, resource overhead involved with that as well.

[00:24:44] Another key thing.

[00:24:45] Is item potency, all of these platforms rely heavily on determinism.

[00:24:53] So just how John was saying, right?

[00:24:54] If something spins back up and is trying to replay, it’s very important to have that item potency built into, um, the, the domain that you’re building, um, as well as item potency with any third parties or things of that nature.

[00:25:12] Cause the last thing you want to do is have something fail.

[00:25:15] It spins back up and maybe a financial transaction gets processed twice because now you’re, you don’t have any item potency because the platform spun back up and replayed the replayed sort of the last sort of step that it was in action of, of, of doing.

[00:25:31] So, yeah, I think that’s, that’s really key.

[00:25:34] And it’s very similar to like any other event driven architecture, right?

[00:25:38] Having item potency as a, as a focus.

[00:25:41] But I’ve seen in a lot of areas where like item potency.

[00:25:45] Isn’t, isn’t like considered, right?

[00:25:48] People kind of just assume that, okay, yeah, it should be fine.

[00:25:52] Happy path is working.

[00:25:54] Oh, I didn’t receive the event twice.

[00:25:56] It’s been working.

[00:25:57] And then, okay, when it does happen, trying to figure out or debug becomes a very sort of stressful, stressful nightmare there.

[00:26:04] So it’s like understanding what item potency is understanding it primarily on the consumer side.

[00:26:11] I’ve seen teams also focus on the producer side from the event.

[00:26:15] Like from events.

[00:26:15] Then we’d be like, oh yeah, we built item potency into our producer.

[00:26:20] And then the consumers downstream are just like, oh yeah, well they’ve built item potency up there.

[00:26:24] So we don’t have to worry about it.

[00:26:26] Right.

[00:26:26] So, but like we all know that probably not, not going to happen, especially with, with events being able to sort of fire off multiple times.

[00:26:35] Just on basis of those systems.

[00:26:37] So really focusing on sort of consumer item potency and, and, and guarding yourself against sort of that.

[00:26:44] Maybe.

[00:26:44] Maybe multiple events or multiple replays of the same thing happening and ensuring that the same, the same thing happens in the state and the same thing occurs.

[00:26:57] So the state maintains throughout the system.

[00:27:01] Yeah.

[00:27:01] I mean, as, as far as I know, you can also have a long running workflows with a state like months long.

[00:27:10] And what do you need specifically regarding, you know, perhaps backboard compatibility?

[00:27:14] Yeah.

[00:27:14] You know, I think it’s, it’s a good question.

[00:27:17] I think what you’re saying is pretty interesting is that, uh, how can you, you know, build a system that’s able to run a service in a single day without having to worry about getting it, you know,

[00:27:21] the same thing, the same thing happening at the same time and having the same, the same thing happening at the same time, uh, with different, with different, with different services and, uh,

[00:27:23] with different versions of a service available and those kinds of things.

[00:27:24] So how, how, how does that kind of thing work?

[00:27:26] Yeah.

[00:27:27] That’s, I think this is where the, these are the platforms there, they haven’t really built that out for you.

[00:27:31] So yeah.

[00:27:32] Really thinking about what that versioning looks like, because once you start deploying a new version out there, and then maybe some long run running process, like you’re saying is running.

[00:27:43] Okay.

[00:27:43] And now you have a new version and then there’s some obviously some some technical failure that occurs there that that can be a problem.

[00:27:52] Another thing where these durable computing platforms aren’t going to save you is if you also have like a business failure as well.

[00:27:59] Like if maybe how you’re how you’re how you’re how you like perceived you were doing something is incorrect.

[00:28:06] And now you need to compensate for those down the line that goes hand in hand with the with the versioning aspect.

[00:28:12] Right. Like you have a bug in the system that we need to fix or someone inputted data that was incorrect that we need to fix that affected downstream systems.

[00:28:22] Yeah, the the the resiliency of the platforms isn’t going to save you from that.

[00:28:26] So really considering that versioning strategy and how you’re going to migrate over and understanding what processes are running in that workflow and how long they’ve been running to ensure that everything sort of continues and works as expected.

[00:28:42] And just curious, so what and from from more of a, you know, a developer mindset perspective, I mean, I know many, maybe most developers are used to, you know, a request response style of developing, you get a request, you produce a response, and then that’s it, or everything that’s related to the transaction happened within the production of that response.

[00:29:10] And you see that as a unit.

[00:29:12] You don’t have to worry about, you know, other things happening in parallel.

[00:29:17] But when we’re talking about these long running processes, stateful, with retries and those kinds of things, the way of approaching and thinking about that shifts, doesn’t it?

[00:29:31] So what have you seen, you know, both how has that shift been for you?

[00:29:37] And what have you seen in teams and the way developers approach it?

[00:29:42] Yeah.

[00:29:42] Definitely a mental, like it’s a mental model shift, obviously, from that more request and response type function to sort of understanding, okay, this is event based, this event triggered this, maybe now it’s been long running for this process this long, then it kicks off something else.

[00:30:00] So it’s really focused in on, rather than, I guess, looking at, I guess, a stack trace, and more focused on, okay, what, what’s kind of the event log or the history?

[00:30:12] And that will translate into how you design the system, right?

[00:30:20] So really fundamentally trying to understand sort of the workflow interactions, understanding, okay, this is, we can kick this off, let it run for this time period and know that we’re going to still continue on at some point once it completes, rather than always having to just wait for that instant feedback, getting into debugging as well.

[00:30:41] Because the debugging does.

[00:30:42] Feel quite different, right?

[00:30:44] Trying to try to understand when a failure does occur or when something goes wrong, how do I, how do I actually debug it?

[00:30:52] Debug the system is, is quite different as well.

[00:30:54] Great.

[00:30:55] Great.

[00:30:56] And maybe, maybe the last, the last topic I, I I’m quite curious about and wanted to hear your thoughts on this is, uh, so we’ll be, people have been talking about, uh, durable computing, some of these platforms connected to, uh, agentic.

[00:31:12] Yeah.

[00:31:12] Development and the use of agents, what’s the connection there?

[00:31:16] Why, why are people talking about these platforms as, uh, you know, enabling AI orchestration and those kinds of things?

[00:31:24] So what, what’s the connection?

[00:31:26] What, what, what have you seen relates to that?

[00:31:28] Yeah, there’s, I guess a new, I guess a new term, new technique, obviously with AI, there’s a million new terms and techniques popping up every second, but I guess they’re calling it durable agents.

[00:31:42] So you can.

[00:31:42] See like, um, temporal restate.

[00:31:46] I think even Vercel has their own durable workflow thing that is focused on, on durable agents that they’ve released.

[00:31:55] I think it’s, it’s almost, it’s a convergence of the two technologies coming together while people are building sort of agentic architectures and making them also distributed potentially with multi agent architectures and orchestration.

[00:32:11] They’re starting to.

[00:32:12] Realize that, okay, what if, what if I can’t reach that LLM provider or what if I can’t search, um, the database for some rag operation, how do I actually then recover when those things are available again?

[00:32:29] I think any of the human in the loop kind of interactions, uh, that you have in those systems as well, because maybe it’s in the middle of a workflow.

[00:32:41] Yeah.

[00:32:42] But you’re, they’re waiting on a human response that thing could sit for days.

[00:32:46] Someone doesn’t respond for days, but you don’t want to have like that agent up and running, just waiting and listening.

[00:32:53] So with these platforms, you can just have it tear down.

[00:32:56] And then once someone responds, it will kick off that workflow again and spin everything back up.

[00:33:02] So, yeah, it’s, it’s a very interesting space, uh, and it’s definitely, it definitely emerging.

[00:33:09] So yeah, it’s exciting to see where, where, where.

[00:33:12] But it should make the life of a developer, uh, a lot easier around these kinds of, uh, agent based solutions.

[00:33:20] You know, you want to interact with those different systems.

[00:33:24] Maybe you want to call a Lambda function and then you want to call the database, um, or whatever.

[00:33:31] And then you’ve got some APIs you need to call a doable is going to make developing those kinds of agents, uh, a lot easier.

[00:33:38] I’ve thought it’s, it fits very nicely.

[00:33:42] And, uh, you know, it’s come along in a timely fashion for, for the AI solutions.

[00:33:48] Yeah.

[00:33:48] Amazing.

[00:33:49] Let’s, let’s keep an eye on that.

[00:33:50] It’s definitely an exciting field.

[00:33:52] So let’s see, let’s see how it evolves, but coming to, to the end of the episode, any, any parting thoughts, uh, you want to share?

[00:34:00] I mean, any, uh, you know, ideas for the future, where’s this headed, or if someone wants to learn more about these platforms, where to start.

[00:34:09] So any, anything you want to share before we.

[00:34:12] Yeah, I think this is, I guess the, the best way to, to get started is just to look up one of these platforms and start playing with them and, and seeing how maybe they can fit into your systems, especially if you’re building a distributed system and your team is struggling with a bunch of failures and recoveries and starting to look and see if this is a viable option for you to maybe start incorporating into your system.

[00:34:39] So, I mean, we listed off a bunch of them.

[00:34:42] There’s Restate, Temporal, Gollum, it, it can be, I think it’s quite overwhelming at the moment, definitely, but it’s definitely, uh, uh, the explosion of these platforms is, is warranted because they do, they do fit a need in the industry, particularly around building distributed systems.

[00:35:05] And is, is, is there any one of those platforms, Brandon, that would be, you know, if you want to start, start here?

[00:35:12] Or they just, they just fit different needs and they have, each of them has their, you know, applicability and, uh.

[00:35:19] I think maybe the easiest thing, maybe the easy ones to get your head around and maybe the most accessible, maybe one of the cloud platform ones, like with the Azure durable functions, or obviously the AWS, um, durable lambdas that we mentioned before.

[00:35:36] Those may be the most accessible for you to play with and spin up and test out.

[00:35:42] I think the other platforms, like the ones that we listed, I listed before, provide a lot more of the bells and whistles that you, you might need in terms of like observability and all these other things that, um, the cloud providers haven’t really focused on.

[00:35:58] I mean, there’s some stuff there, but there’s a lot more tooling with, with these other platforms.

[00:36:03] So maybe it’s, it’s worthwhile starting with those to get your, get your feet wet and understand them.

[00:36:08] But then once you want to start adopting, start considering some of these.

[00:36:12] Yeah, you can join us on the durable computing space, uh, as well, if you want to have questions or to talk about these things, we’re trying to gather info together to help with these kinds of processes of, of how you might select the right technology, what the features are.

[00:36:31] And I also put together a, uh, GitHub, uh, project, which you can check out.

[00:36:36] I did a, like a POC with Temporal.

[00:36:39] So you can, you can pull that and play.

[00:36:42] With that, it’s quite simple.

[00:36:44] Um, I also had experimentation with Golem.

[00:36:48] Uh, Golem is, is quite fresh and not, I wouldn’t say it’s particularly production ready yet, but it’s a very interesting new player on the scene.

[00:36:57] It takes quite a radical and different approach to durable computing than the more established solutions.

[00:37:05] Um, yeah, so you can have a look at, have a look at the code and play with it.

[00:37:11] Um, but yeah.

[00:37:12] Of course you can always check the, the websites and have a look at the platforms themselves on the websites.

[00:37:18] They’re going to tell you everything you need to know.

[00:37:20] All right.

[00:37:21] Then I guess this brings us to the end of this episode, Brandon, John, thank you very much for joining.

[00:37:28] It’s been an amazing conversation.

[00:37:29] Lots of fun.

[00:37:31] Thank you very much.

[00:37:32] Bye.

Vox

Explorador