Add ResumableJobMixin with SparkSubmitOperator as a case study for surviving worker failures (standalone)#67118
Conversation
| if self._hook is None: | ||
| self._hook = self._get_hook() | ||
| if self._hook._should_track_driver_status: | ||
| return self.execute_resumable(context) |
There was a problem hiding this comment.
This should really be self.defer() at this point/soon after it. It's precisely the use case Triggers are ideal for (that where a job is submitted to a remote service and you have an ID and want to efficiently poll for it to be completed.
There was a problem hiding this comment.
Yes, this is a textbook deferrable case. This PR is specifically addressing the synchronous operator case, which is a different problem from the deferrable case.
For deferrable operators, the triggerer already handles continuity. The failure mode I am attempting fixing here (worker dies mid-poll, driver ID lost, retry resubmits a duplicate) does not apply to deferrable operators once they have deferred.
The reason i am not using defer() here is that the deferrable path has its own adoption barriers — it requires a separate Triggerer component, introduces an async programming model that is a different mental model from standard sync model, and has seen very low adoption in practice despite being available for several years. Teams that write custom operators are known to write synchronous ones. These are reasons to solve the sync case on its own terms rather than requiring users to migrate to deferrable.
The two approaches are complementary, not competing. The spark_job_id that we persist to task_state is intentionally compatible with a future deferrable implementation — a SparkSubmitTrigger could read the same key from task_state to reconnect before deferring. But thats a diff problem as I see it
There was a problem hiding this comment.
+1 on the approach. We've used custom deferrable sensors extensively, but run GlueJobOperator synchronously in production. We explored deferrable for compute operators but didn't fully adopt it (triggerer bottlenecks, didn't dive deep enough into reliability concerns). Either way, the worker crash recovery problem is orthogonal to deferrable, as you point out.
Glue seems like a great case to test this as well, since it recently added resume_glue_job_on_retry which aims to solve the same problem via XCom + scanning all job runs as fallback. Haven't tried it yet but plan to shortly. Would the intent be for operators like Glue to eventually adopt ResumableJobMixin and replace that ad-hoc mechanism ?
There was a problem hiding this comment.
@Alexhans thanks for chiming in. Yes, exactly thats the intent. GlueJobOperator``resume_glue_job_on_retry is solving the same problem but with two rough edges that task_state addresses cleanly:
- XCom is shared across all TIs in a DAG run — using it for retry continuity is a workaround.
task_stateis scoped to a single TI across its retries, which is exactly the right scope for "what job did I submit last time."
"Scan all job runs as fallback" exists because there's no reliable place to store the ID before the worker dies. Withtask_state, the ID is persisted before polling starts, ie: no scan needed, no ambiguity about which run belongs to this retry. - Migrating Glue to
ResumableJobMixincan be a probably nice follow-up — drop the XCom write + scan logic, implement the six abstract methods for resumable, and the retry behaviour becomes standard.
I'm not sure that is always true -- it depends on what mode you submit the job as. (My spark knowledge is rusty though) |
@ashb This is true specifically for The PR targets cluster mode only. Will tighten the PR description. |
jscheffl
left a comment
There was a problem hiding this comment.
Sorry for delayed review.
As expressed in devcall I am not fully convinced that SparkSubmit is taken as an example. If there are real issues with Deferred execution then I think we should also ensure this is running reliable.
I understand concerns and for bis spark jobs running a dedicated worker might be neglegible but in my view we should not "advertise" this as the new superior solution for long running tasks. Still should be Triggerer in my view and using this MixIn for cases where Triggerer is not reliable (as a workaround until improved) or in cases where knowledge or complexity is not matching implementing a Deferred.
Some non-blocking comments on the interface. Not blocking but just some thoughts.
Note that KPO is not a candidate for this interface because it already imeplements a resume option on both deferrable=true/false. Before a new Pod is started a query is made to K8s if a Pod can be found via labels as labels get task-id, dag-id, run-id and map-index attached. So state is in K8s already. Operators inheriting from KPO inherit this feature as well.
Especially in KPO a lot of fixes were made the last ~9months (e.g. by @AutomationDev85 and @wolfdn) so before assuming a resume option is added we should recommend using Triggerer. This includes log tailing as well.
Otherwise: Looks great! Hope to get this in 3.3.0!
| def submit_job(self, context) -> str: | ||
| return self.hook.submit(...) | ||
|
|
||
| def get_job_status(self, external_id: str) -> str: |
There was a problem hiding this comment.
How about exposing this as property and not as a get_*() method?
There was a problem hiding this comment.
get_job_result(self, external_id, context) takes a parameter so it cannot be a property. It takes args and submits a job
There was a problem hiding this comment.
That is what I was wondering about from interface perspective as well. If it is common then the external ID could be in side the MixIn.
There was a problem hiding this comment.
It already is, external_id_key is already defined in the mixin (with a default of "remote_job_id"). Subclasses override it with their specific key name, e.g. external_id_key = "spark_job_id" in SparkSubmitOperator. The storage and retrieval in execute_resumable() reads self.external_id_key directly, ie: the operator only declares the name, the mixin owns the read/write logic.
| return self.hook.get_status(external_id) | ||
|
|
||
| def is_job_active(self, status: str) -> bool: | ||
| return status in ("RUNNING", "PENDING") |
There was a problem hiding this comment.
Would it make sense instead of adding new strings using the TaskInstance.state enum for this?
There was a problem hiding this comment.
Alternatively if you think this should not be leveraged adding a new enum for the state?
There was a problem hiding this comment.
The status strings ("RUNNING", "PENDING", etc.) are external system states not Airflow's TaskInstanceState. TaskInstanceState has values like running, success, failed which are Airflow task states, not remote job states.
An enum in the mixin would force every provider to translate their native states into it, adding a layer with no real benefit.
The current contract — two boolean methods is_job_active() and is_job_succeeded() — is deliberately looser. The mixin does not need to know what states are possible, it just asks "is this still running?" and the subclass answers with whatever logic fits that system. Stays composable across providers.
There was a problem hiding this comment.
Okay, then I understand it after reading your comments. Maybe this should be expressed in interface contract then. The status this is any kind of string whatever the respective backend has, could be also a "ContainerCreating" or "Approval pending" or "Pizza was ordered, delivery will arrive in 20min"... very much based on specific backend.
Can you express this in the docs?
There was a problem hiding this comment.
I just added it: 355f485, thanks, good call.
|
Post-Approve comment: Would it be possible to extend the MixIn as well with partner functions in async that the MixIn can also be used to transfer the state to Triggerer? Would be a cool convenience. |
|
@jscheffl On the KPO front: i think even there, task state could make it more efficient. Rather than having to search by label, we can store the actual launched pod name/id in state. I know there's been some search performance issues at the K8s API in the past |
|
@jscheffl thanks for your review. I agree that deferrable should remain the recommended path for long-running tasks — that is not what this is trying to change. The goal here is much narrower: making the existing sync path "survivable" for teams already on it. The framing I would probably use here is: And tbh, it's not a question of sync vs async, but making sync better for teams who can't or won't do async right now. I will make this explicit in the mixin's docstring so it's clear to anyone reading the code. |
|
And regarding KPO, it is worth noting that |
791de5d to
d1fa89d
Compare
| This is the **synchronous path** — the worker holds a slot for the duration of polling. This is | ||
| intentional for teams that prefer sync operators for log observability, org constraints, or | ||
| because a Triggerer is not available. It is not a replacement for deferrable operators; the two | ||
| approaches are complementary. |
There was a problem hiding this comment.
| This is the **synchronous path** — the worker holds a slot for the duration of polling. This is | |
| intentional for teams that prefer sync operators for log observability, org constraints, or | |
| because a Triggerer is not available. It is not a replacement for deferrable operators; the two | |
| approaches are complementary. | |
| This is the **synchronous path** — the worker holds a slot for the duration of polling. This is | |
| a crash-safety net for teams running sync operators for log observability, org constraints, or | |
| because a Triggerer is not available. If a Triggerer is available, deferrable | |
| operators are the better choice for long-running tasks. |
The original "complementary" implies parity that doesn't exist - deferrable wins on resource cost whenever it's an option. Worth updating the docs to say so explicitly.
There was a problem hiding this comment.
@shahar1 good callout, this is how I handled it: e7701de, some small edits to what you mention:
This is the synchronous path — the worker holds a slot for the duration of polling. This is
a crash-safety net for teams running sync operators for log observability, org constraints, or
because a Triggerer is not available. Teams with a Triggerer available may also consider
deferrable operators, which free the worker slot but may come with added complexity.
WDYT?
XD-DENG
left a comment
There was a problem hiding this comment.
Another question I would have is: earlier when I was talking to @kaxil , seems to me the direction to "build a triggerer for this purpose and make this operator deferrable" is preferred?
Do we need to build two modes? Would like to understand @kaxil 's thoughts on this.
No strong opinion on this though.
|
Alright I am back on this one now. |
@XD-DENG The mixin should not be thought of a second mode, rather it makes the existing sync path crash-safe. If a team is/can already using deferrable operators, nothing here changes for them. @kaxil can add more if I missed anything. |
| self._hook = self._get_hook() | ||
| self._hook.submit(self.application) | ||
| hook = self._hook | ||
| if hook._should_track_driver_status and self.reconnect_on_retry: |
There was a problem hiding this comment.
reconnect_on_retry=False currently skips execute_resumable() entirely, but polling was removed from hook.submit() for cluster mode. Result: standalone cluster + reconnect_on_retry=False submits and returns immediately — task succeeds while the driver is still running.
I think here we need poll_until_complete call ?
There was a problem hiding this comment.
Good catch — you're right. The opt-out path fell through to bare hook.submit() which no longer polls in cluster mode (polling was extracted into poll_until_complete). Fixed: the reconnect_on_retry=False + cluster-mode path now calls submit_job + poll_until_complete directly, giving it the same submit-and-wait semantics as before without the crash recovery.
Background
Long-running Spark jobs (and similar remote execution patterns) have historically had two options in Airflow:
Deferrable operators — the task defers, hands control to the Triggerer, and the worker slot is freed. In theory this handles the "worker dies" case because the Triggerer owns continuity. In practice, deferrable adoption has been low: the async programming model is a different mental model from standard data engineering, requires a separate Triggerer component to operate and scale, and the
defer()boundary means a killed task duringdeferral(e.g. user clearing while deferred) still loses the job handle. Zero custom deferrable operators have been written by teams that do write custom regular operators, for these reasons.Custom operators with manual application ID tracking — some teams build their own operators that capture the YARN application ID or Spark driver ID immediately after submission, persist it to some storage, and on retry look it up before deciding whether to resubmit. It works but requires significant per-team investment and the storage layer is bespoke — no standard interface, no UI visibility, no composability with Airflow's retry policies.
This PR takes a third path: synchronous execution with checkpointing. The operator runs in a worker slot (same mental model as today, no Triggerer needed), persists the job handle to the AIP-103
task_statebefore polling begins, and on retry reads it back. From the data engineers perspective the operator looks and behaves identically to before — it just survives worker disruptions without losing its place.What problem are we solving?
This is the side benefit of AIP 103 on a test subject.
This PR handles
clusterdeploy mode where the driver runs on the cluster independently of the submitter. In client mode the driver runs inside thespark-submitprocess, so killing the worker also kills the driver. The PR targets cluster mode only.When an Airflow worker running a
SparkSubmitOperatortask dies mid-execution, the spark driver keeps running on the spark cluster independently. Airflow has no recollection of the previous submission, ie: on retry it creates a fresh operator instance and submits a new Spark job irrespective of what happened with the earlier submitted spark job. The original driver's work is wasted, and both jobs may run concurrently consuming duplicate compute.This was reproduced empirically: triggered a
SparkPijob in spark cluster mode, killed the Airflow worker while the driver was running, observed and waited for the driver complete on the spark cluster, then watched the retry submit a second driver with a different ID.Current behaviour
SparkSubmitOperator.execute()callsself._hook.submit(), a single blocking call that submits the job, extracts the driver ID, and polls — all in one function. The driver ID is stored onself._hook._driver_id, in memory on the worker side. When the worker dies, that memory is gone. Retry submits a new job with no knowledge of the previous one.Proposed change
ResumableJobMixin: a new mixin for operators that submit one long-running job to an external system and poll for completion. It provides a single method,execute_resumable(context), that operators call from their ownexecute():submit_job(), persists the returned external ID totask_statebefore polling starts, then callspoll_until_complete()task_state, checks its status, skips resubmission and reconnects directly to the running jobThe mixin is generic — it knows nothing about spark and that is intention. Six abstract methods (
submit_job,get_job_status,is_job_active,is_job_succeeded,poll_until_complete,get_job_result) are implemented by each operator for its external system.SparkSubmitOperator is wired to the mixin for Spark Standalone cluster mode, and this is how its going to operate:
execute()routes toexecute_resumable()whenhook._should_track_driver_statusis True (standalone cluster mode only — the existing flag that YARN and K8s set to False). Adding YARN/K8s resumable support in a follow-up only requires updating that flag;execute()routing is stable.get_job_status()queries the Spark REST API, replacing the existingspark-submit --statussubprocess approach which is a blocking sync call that ran a binaryexternal_id_key = "spark_job_id"is intentionally generic — works for standalone (driver-xxx), YARN (application-xxx), and K8s (pod name)hook.submit()now returns the driver ID instead of None, with theinternal _start_driver_status_tracking()call removed — polling is now owned bypoll_until_complete()in the operator.For flow, see this:
Before:
After:
Opt-out
Teams who prefer the previous behaviour (always resubmit on retry) can set
reconnect_on_retry=Falseon the operator. The default is True — reconnect when a prior job is found. Reasoning: almost no one wants a Spark job to resubmit when the worker dies; Spark jobs are expensive and a duplicate submission wastes compute. The opt-out exists for teams who have specific reasons to always resubmit.User implications / backcompat
task_statebefore polling. On retry, Airflow reconnects instead of resubmitting. No DAG changes required.try/except— always submits fresh, identical to todays behaviour.hook.submit()return type changed fromNonetostr | None— not a breaking change; no existing caller used the return value.reconnect_on_retry=Falseallows opting out of reconnecting to a running spark job and submitting fresh always.Notes worth knowing
submit_job()returning andtask_state.set()completing. If the worker dies in this window, the retry submits a duplicate.Testing
Setup
Created a spark standalone setup
services: spark-master: image: spark:python3 container_name: spark-master user: root command: - bash - -c - | mkdir -p /usr/lib/jvm && ln -sf /opt/java/openjdk /usr/lib/jvm/java-17-openjdk-arm64 exec /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master -h spark-master ports: - "7077:7077" - "8080:8080" - "6066:6066" volumes: - ./:/opt/spark/apps networks: - spark-net spark-worker: image: spark:python3 container_name: spark-worker user: root command: - bash - -c - | mkdir -p /usr/lib/jvm && ln -sf /opt/java/openjdk /usr/lib/jvm/java-17-openjdk-arm64 exec /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077 environment: - SPARK_WORKER_MEMORY=4G - SPARK_WORKER_CORES=2 volumes: - ./:/opt/spark/apps depends_on: - spark-master networks: - spark-net networks: spark-net: driver: bridgeStart with breeze, configure breeze to run spark jobs
Start Breeze, then run the one-time setup to install java:
Reproducing the need for checkpointing
Spark Standalone Mode
Use this DAG:
Before applying fix (from main)
airflow dags trigger spark_cluster_mode_reproidentified spark driver id: driver-XXXXXlogkill -9 $(pgrep -f "airflow task run")spark-submit ...with a different driver ID, And spark UI at http://localhost:8080 shows 2 drivers running.This basically means, when Airflow worker came back up, it wasn't aware that spark driver ran through and it submitted another one, doing duplicate work.
First Run:
Killed worker mid run:
Wait until this stage:
Airflow worker should be brought back up:
Resubmit happens from airflow:
Spark ends up doing more work / duplicate work:
After applying the fix
Same steps
First Run:
Kill worker mid run
Wait until:
Airflow worker should be brought back up:
No rework submitted on spark:
What is next
poll_until_completepolling loop.ResumableJobMixinis ready for Flink, EMR, Databricks, KPOtask_statecan storespark_progress(percentage complete) by querying the Spark master stage API.Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.