Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect

# Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect

## Bug Description

On `livekit-agents 1.4.6` / `livekit 1.1.2` (Python, SIP telephony on LiveKit Cloud), when a SIP caller hangs up, the worker job process intermittently **crashes during teardown** instead of shutting down cleanly. The process is killed by a native signal — we see both **`-6` (SIGABRT)** and **`-11` (SIGSEGV)** from the same teardown path.

In the SIGABRT case, glibc reports heap corruption immediately before aborting:

```
malloc(): unsorted double linked list corrupted
```

It happens at the instant two teardown paths overlap: the framework auto-closing the `AgentSession` on participant disconnect (`close_on_disconnect`) **concurrently with** our own teardown (closing a `BackgroundAudioPlayer` and the session). Timeline from one affected call:

```
08:13:18.731  End Call (disconnect_reason: user_hangup / CLIENT_INITIATED)
08:13:18.733  closing agent session due to participant disconnect (disable via RoomInputOptions.close_on_disconnect=False)
08:13:18.736  (app) Closing Background Audio
08:13:18.790  (app) Closing agent session  +  Background Audio - Closed
08:13:18.834  malloc(): unsorted double linked list corrupted
08:13:18.866  process exited with non-zero exit code -6
```

The crash occurs **after** the call has completed (the caller has already hung up), so the call audio itself is unaffected. The problems are: (1) the job's post-call shutdown logic is skipped, and (2) the worker dies abnormally. Because the same teardown path produces both a glibc heap-corruption abort and a segfault depending on timing, this looks like a **non-deterministic use-after-free / double-free in the native/FFI layer** surfacing during concurrent teardown.

## Expected Behavior

Teardown on participant disconnect should complete cleanly — no native heap corruption, no abnormal process termination — even when the framework's `close_on_disconnect` auto-teardown overlaps with application-initiated teardown of the session and a `BackgroundAudioPlayer`. The job process should run its shutdown hooks and exit 0.

## Reproduction Steps

```bash
# Production-only and non-deterministic (timing-dependent); no minimal repro yet.
# Conditions under which we observe it:
1. AgentSession started with a BackgroundAudioPlayer active, RoomIO default close_on_disconnect=True.
2. Inbound SIP call; agent converses normally for the duration of the call.
3. Remote (SIP) participant hangs up -> framework begins auto-closing the session
   (close_on_disconnect) at the same time the app closes the BackgroundAudioPlayer and the session.
4. Intermittently the process aborts during this teardown:
     - SIGABRT (-6) with: malloc(): unsorted double linked list corrupted
     - or SIGSEGV (-11) at the same teardown point
```

## Operating System

Linux (containerized on AWS EKS / Kubernetes, x86_64, Python 3.11)

## Models Used

STT: Deepgram · TTS: ElevenLabs · VAD: Silero · LLM: Anthropic

## Package Versions

```bash
livekit==1.1.2
livekit-agents==1.4.6
livekit-api==1.0.7
livekit-blingfire==1.1.0
livekit-plugins-anthropic==1.4.6
livekit-plugins-cartesia==1.4.6
livekit-plugins-deepgram==1.4.6
livekit-plugins-elevenlabs==1.4.6
livekit-plugins-google==1.4.6
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.4.6
livekit-plugins-silero==1.4.6
livekit-plugins-turn-detector==1.4.6
livekit-protocol==1.1.1
```

## Session/Room/Call IDs

```
LiveKit Cloud Project: p_3tqm7ro6kbs

SIGABRT (-6) / "malloc(): unsorted double linked list corrupted":
  roomID: RM_WG8TCQJXrJTG
  jobID:  AJ_dJtwgP7UhRAj
  time:   2026-06-18 ~08:13:18 UTC

(We have also observed SIGSEGV (-11) from the same BackgroundAudio/session teardown path on
 other calls; happy to provide those room IDs on request.)
```

## Proposed Solution

```python
# Tentative — the underlying fault appears to be in the native/FFI layer, not a simple
# double-close at the Python level. Two directions that may help:
#
# 1. Make the BackgroundAudioPlayer / AgentSession close path idempotent and serialized with
#    the framework's close_on_disconnect auto-teardown, so the two cannot run concurrently
#    over the same native objects.
# 2. Guard the native FFI close against concurrent invocation / use-after-free during teardown.
#
# As a workaround we are evaluating RoomInputOptions.close_on_disconnect=False so the app owns
# teardown exclusively, but that does not address the underlying native memory-safety issue.
```

## Additional Context

- The crash is **post-hangup**, so caller audio is unaffected — but the worker dies before completing post-call shutdown. In our observed case, LiveKit re-dispatched the job ~20s later and a second process completed the post-call work; that recovery is **not guaranteed** (if the room is already gone, post-call processing for the crashed call is lost).
- Both `-6` (SIGABRT, glibc `malloc(): unsorted double linked list corrupted`) and `-11` (SIGSEGV) originate from the same teardown sequence (`Closing Background Audio` + `Closing agent session`).
- Likely related to `BackgroundAudioPlayer` / `AgentSession` close ordering relative to `close_on_disconnect`.

## Screenshots and Recordings

_Full agent logs for the room/job above available on request._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect #6149

Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect

Bug Description

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect #6149

Description

Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect

Bug Description

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions