Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect
Bug Description
On livekit-agents 1.4.6 / livekit 1.1.2 (Python, SIP telephony on LiveKit Cloud), when a SIP caller hangs up, the worker job process intermittently crashes during teardown instead of shutting down cleanly. The process is killed by a native signal — we see both -6 (SIGABRT) and -11 (SIGSEGV) from the same teardown path.
In the SIGABRT case, glibc reports heap corruption immediately before aborting:
malloc(): unsorted double linked list corrupted
It happens at the instant two teardown paths overlap: the framework auto-closing the AgentSession on participant disconnect (close_on_disconnect) concurrently with our own teardown (closing a BackgroundAudioPlayer and the session). Timeline from one affected call:
08:13:18.731 End Call (disconnect_reason: user_hangup / CLIENT_INITIATED)
08:13:18.733 closing agent session due to participant disconnect (disable via RoomInputOptions.close_on_disconnect=False)
08:13:18.736 (app) Closing Background Audio
08:13:18.790 (app) Closing agent session + Background Audio - Closed
08:13:18.834 malloc(): unsorted double linked list corrupted
08:13:18.866 process exited with non-zero exit code -6
The crash occurs after the call has completed (the caller has already hung up), so the call audio itself is unaffected. The problems are: (1) the job's post-call shutdown logic is skipped, and (2) the worker dies abnormally. Because the same teardown path produces both a glibc heap-corruption abort and a segfault depending on timing, this looks like a non-deterministic use-after-free / double-free in the native/FFI layer surfacing during concurrent teardown.
Expected Behavior
Teardown on participant disconnect should complete cleanly — no native heap corruption, no abnormal process termination — even when the framework's close_on_disconnect auto-teardown overlaps with application-initiated teardown of the session and a BackgroundAudioPlayer. The job process should run its shutdown hooks and exit 0.
Reproduction Steps
# Production-only and non-deterministic (timing-dependent); no minimal repro yet.
# Conditions under which we observe it:
1. AgentSession started with a BackgroundAudioPlayer active, RoomIO default close_on_disconnect=True.
2. Inbound SIP call; agent converses normally for the duration of the call.
3. Remote (SIP) participant hangs up -> framework begins auto-closing the session
(close_on_disconnect) at the same time the app closes the BackgroundAudioPlayer and the session.
4. Intermittently the process aborts during this teardown:
- SIGABRT (-6) with: malloc(): unsorted double linked list corrupted
- or SIGSEGV (-11) at the same teardown point
Operating System
Linux (containerized on AWS EKS / Kubernetes, x86_64, Python 3.11)
Models Used
STT: Deepgram · TTS: ElevenLabs · VAD: Silero · LLM: Anthropic
Package Versions
livekit==1.1.2
livekit-agents==1.4.6
livekit-api==1.0.7
livekit-blingfire==1.1.0
livekit-plugins-anthropic==1.4.6
livekit-plugins-cartesia==1.4.6
livekit-plugins-deepgram==1.4.6
livekit-plugins-elevenlabs==1.4.6
livekit-plugins-google==1.4.6
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.4.6
livekit-plugins-silero==1.4.6
livekit-plugins-turn-detector==1.4.6
livekit-protocol==1.1.1
Session/Room/Call IDs
LiveKit Cloud Project: p_3tqm7ro6kbs
SIGABRT (-6) / "malloc(): unsorted double linked list corrupted":
roomID: RM_WG8TCQJXrJTG
jobID: AJ_dJtwgP7UhRAj
time: 2026-06-18 ~08:13:18 UTC
(We have also observed SIGSEGV (-11) from the same BackgroundAudio/session teardown path on
other calls; happy to provide those room IDs on request.)
Proposed Solution
# Tentative — the underlying fault appears to be in the native/FFI layer, not a simple
# double-close at the Python level. Two directions that may help:
#
# 1. Make the BackgroundAudioPlayer / AgentSession close path idempotent and serialized with
# the framework's close_on_disconnect auto-teardown, so the two cannot run concurrently
# over the same native objects.
# 2. Guard the native FFI close against concurrent invocation / use-after-free during teardown.
#
# As a workaround we are evaluating RoomInputOptions.close_on_disconnect=False so the app owns
# teardown exclusively, but that does not address the underlying native memory-safety issue.
Additional Context
- The crash is post-hangup, so caller audio is unaffected — but the worker dies before completing post-call shutdown. In our observed case, LiveKit re-dispatched the job ~20s later and a second process completed the post-call work; that recovery is not guaranteed (if the room is already gone, post-call processing for the crashed call is lost).
- Both
-6 (SIGABRT, glibc malloc(): unsorted double linked list corrupted) and -11 (SIGSEGV) originate from the same teardown sequence (Closing Background Audio + Closing agent session).
- Likely related to
BackgroundAudioPlayer / AgentSession close ordering relative to close_on_disconnect.
Screenshots and Recordings
Full agent logs for the room/job above available on request.
Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect
Bug Description
On
livekit-agents 1.4.6/livekit 1.1.2(Python, SIP telephony on LiveKit Cloud), when a SIP caller hangs up, the worker job process intermittently crashes during teardown instead of shutting down cleanly. The process is killed by a native signal — we see both-6(SIGABRT) and-11(SIGSEGV) from the same teardown path.In the SIGABRT case, glibc reports heap corruption immediately before aborting:
It happens at the instant two teardown paths overlap: the framework auto-closing the
AgentSessionon participant disconnect (close_on_disconnect) concurrently with our own teardown (closing aBackgroundAudioPlayerand the session). Timeline from one affected call:The crash occurs after the call has completed (the caller has already hung up), so the call audio itself is unaffected. The problems are: (1) the job's post-call shutdown logic is skipped, and (2) the worker dies abnormally. Because the same teardown path produces both a glibc heap-corruption abort and a segfault depending on timing, this looks like a non-deterministic use-after-free / double-free in the native/FFI layer surfacing during concurrent teardown.
Expected Behavior
Teardown on participant disconnect should complete cleanly — no native heap corruption, no abnormal process termination — even when the framework's
close_on_disconnectauto-teardown overlaps with application-initiated teardown of the session and aBackgroundAudioPlayer. The job process should run its shutdown hooks and exit 0.Reproduction Steps
Operating System
Linux (containerized on AWS EKS / Kubernetes, x86_64, Python 3.11)
Models Used
STT: Deepgram · TTS: ElevenLabs · VAD: Silero · LLM: Anthropic
Package Versions
Session/Room/Call IDs
Proposed Solution
Additional Context
-6(SIGABRT, glibcmalloc(): unsorted double linked list corrupted) and-11(SIGSEGV) originate from the same teardown sequence (Closing Background Audio+Closing agent session).BackgroundAudioPlayer/AgentSessionclose ordering relative toclose_on_disconnect.Screenshots and Recordings
Full agent logs for the room/job above available on request.