Skip to content

Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect #6149

@zaheerabbas-prodigal

Description

@zaheerabbas-prodigal

Worker process aborts (SIGABRT / glibc heap corruption, also SIGSEGV) during concurrent session + BackgroundAudio teardown on participant disconnect

Bug Description

On livekit-agents 1.4.6 / livekit 1.1.2 (Python, SIP telephony on LiveKit Cloud), when a SIP caller hangs up, the worker job process intermittently crashes during teardown instead of shutting down cleanly. The process is killed by a native signal — we see both -6 (SIGABRT) and -11 (SIGSEGV) from the same teardown path.

In the SIGABRT case, glibc reports heap corruption immediately before aborting:

malloc(): unsorted double linked list corrupted

It happens at the instant two teardown paths overlap: the framework auto-closing the AgentSession on participant disconnect (close_on_disconnect) concurrently with our own teardown (closing a BackgroundAudioPlayer and the session). Timeline from one affected call:

08:13:18.731  End Call (disconnect_reason: user_hangup / CLIENT_INITIATED)
08:13:18.733  closing agent session due to participant disconnect (disable via RoomInputOptions.close_on_disconnect=False)
08:13:18.736  (app) Closing Background Audio
08:13:18.790  (app) Closing agent session  +  Background Audio - Closed
08:13:18.834  malloc(): unsorted double linked list corrupted
08:13:18.866  process exited with non-zero exit code -6

The crash occurs after the call has completed (the caller has already hung up), so the call audio itself is unaffected. The problems are: (1) the job's post-call shutdown logic is skipped, and (2) the worker dies abnormally. Because the same teardown path produces both a glibc heap-corruption abort and a segfault depending on timing, this looks like a non-deterministic use-after-free / double-free in the native/FFI layer surfacing during concurrent teardown.

Expected Behavior

Teardown on participant disconnect should complete cleanly — no native heap corruption, no abnormal process termination — even when the framework's close_on_disconnect auto-teardown overlaps with application-initiated teardown of the session and a BackgroundAudioPlayer. The job process should run its shutdown hooks and exit 0.

Reproduction Steps

# Production-only and non-deterministic (timing-dependent); no minimal repro yet.
# Conditions under which we observe it:
1. AgentSession started with a BackgroundAudioPlayer active, RoomIO default close_on_disconnect=True.
2. Inbound SIP call; agent converses normally for the duration of the call.
3. Remote (SIP) participant hangs up -> framework begins auto-closing the session
   (close_on_disconnect) at the same time the app closes the BackgroundAudioPlayer and the session.
4. Intermittently the process aborts during this teardown:
     - SIGABRT (-6) with: malloc(): unsorted double linked list corrupted
     - or SIGSEGV (-11) at the same teardown point

Operating System

Linux (containerized on AWS EKS / Kubernetes, x86_64, Python 3.11)

Models Used

STT: Deepgram · TTS: ElevenLabs · VAD: Silero · LLM: Anthropic

Package Versions

livekit==1.1.2
livekit-agents==1.4.6
livekit-api==1.0.7
livekit-blingfire==1.1.0
livekit-plugins-anthropic==1.4.6
livekit-plugins-cartesia==1.4.6
livekit-plugins-deepgram==1.4.6
livekit-plugins-elevenlabs==1.4.6
livekit-plugins-google==1.4.6
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.4.6
livekit-plugins-silero==1.4.6
livekit-plugins-turn-detector==1.4.6
livekit-protocol==1.1.1

Session/Room/Call IDs

LiveKit Cloud Project: p_3tqm7ro6kbs

SIGABRT (-6) / "malloc(): unsorted double linked list corrupted":
  roomID: RM_WG8TCQJXrJTG
  jobID:  AJ_dJtwgP7UhRAj
  time:   2026-06-18 ~08:13:18 UTC

(We have also observed SIGSEGV (-11) from the same BackgroundAudio/session teardown path on
 other calls; happy to provide those room IDs on request.)

Proposed Solution

# Tentative — the underlying fault appears to be in the native/FFI layer, not a simple
# double-close at the Python level. Two directions that may help:
#
# 1. Make the BackgroundAudioPlayer / AgentSession close path idempotent and serialized with
#    the framework's close_on_disconnect auto-teardown, so the two cannot run concurrently
#    over the same native objects.
# 2. Guard the native FFI close against concurrent invocation / use-after-free during teardown.
#
# As a workaround we are evaluating RoomInputOptions.close_on_disconnect=False so the app owns
# teardown exclusively, but that does not address the underlying native memory-safety issue.

Additional Context

  • The crash is post-hangup, so caller audio is unaffected — but the worker dies before completing post-call shutdown. In our observed case, LiveKit re-dispatched the job ~20s later and a second process completed the post-call work; that recovery is not guaranteed (if the room is already gone, post-call processing for the crashed call is lost).
  • Both -6 (SIGABRT, glibc malloc(): unsorted double linked list corrupted) and -11 (SIGSEGV) originate from the same teardown sequence (Closing Background Audio + Closing agent session).
  • Likely related to BackgroundAudioPlayer / AgentSession close ordering relative to close_on_disconnect.

Screenshots and Recordings

Full agent logs for the room/job above available on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions