Add litellm integration #296

rishi763 · 2025-06-01T06:35:24Z

No description provided.

gemini-code-assist

Hello @rishi763, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

As gemini-code-assist, I've reviewed this pull request which introduces integration with the LiteLLM library. The core change is the addition of a custom LiteLLM callback handler (JudgevalLitellmCallbackHandler) that hooks into LiteLLM's API call lifecycle. This handler creates and updates judgeval trace spans for each LiteLLM interaction, capturing inputs, outputs, duration, and errors.

A significant part of this integration involved implementing a save coordination mechanism within the judgeval tracer. Because LiteLLM operations can be asynchronous and might complete after the main trace context manager exits, the tracer now includes logic to defer trace saving until all pending LiteLLM operations associated with that trace have completed. This ensures that LLM call details are fully captured before the trace is finalized and saved.

The pull request also includes a comprehensive test suite (test_litellm.py) to validate the callback handler's functionality, span creation, error handling, save coordination, and end-to-end tracing with mock LiteLLM calls.

Highlights

LiteLLM Integration: Adds a new JudgevalLitellmCallbackHandler that integrates with LiteLLM's custom logger mechanism to automatically trace LLM calls made via LiteLLM.
LLM Span Creation: The callback handler creates 'llm' type spans for each LiteLLM API call, recording the model, messages (inputs), response content (output), duration, and any errors.
Save Coordination: Introduces a mechanism in the Tracer and TraceClient to defer trace saving when LiteLLM operations are in progress, ensuring that asynchronous LLM call results are included in the final trace data.
Comprehensive Testing: A new test file (test_litellm.py) is added with multiple tests covering the callback handler's lifecycle, span updates, error handling, save deferral, and end-to-end integration.

Changelog

Click here to see the changelog

src/e2etests/test_litellm.py
- Adds new end-to-end tests for the LiteLLM integration.
- Includes tests for callback handler creation and registration.
- Tests span creation, updates, and error handling for LiteLLM calls.
- Validates the save coordination mechanism to ensure traces are saved after LiteLLM operations complete.
- Tests tracing multiple LiteLLM calls within a single trace.
src/judgeval/common/tracer.py
- Adds _safe_to_save, _deferred_save_pending, _deferred_save_args, and _save_lock attributes to Tracer for save coordination.
- Introduces litellm_operation_started and litellm_operation_completed methods in Tracer to manage the save state.
- Modifies TraceClient.save to check the _safe_to_save flag and defer saving if necessary.
- Extracts the actual save logic into a new private method _perform_actual_save.
- Adds an optional span_id parameter to TraceClient.async_evaluate (line 404) to allow external handlers to link evaluations to specific spans.
- Includes minor formatting and comment adjustments.
src/judgeval/integrations/litellm_integration.py
- Adds a new file containing the JudgevalLitellmCallbackHandler class.
- Implements the litellm.integrations.custom_logger.CustomLogger interface.
- Hooks into log_pre_api_call, log_success_event, log_failure_event, and their async counterparts.
- Creates and updates TraceSpan objects within the current judgeval trace for each LiteLLM call.
- Signals the Tracer when a LiteLLM operation starts and completes to manage save coordination.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

LLM speaks, a stream flows,
Tracer watches, where thought goes.
Span captures the light,
Through day and through night,
Till the final token shows.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Overall, this pull request introduces the LiteLLM integration and adds corresponding tests. The tests cover various aspects of the integration, which is good. The changes in tracer.py implement the necessary context propagation and save coordination logic. The core idea of using a LiteLLM callback handler to integrate with the tracer is a sound approach.

I've reviewed the code focusing on correctness, efficiency, maintainability, and security, keeping in mind the request to comment on medium, high, and critical severity issues. I've also considered common Python best practices, defaulting to principles aligned with PEP 8 where applicable, as no specific style guide was provided.

While the functionality seems to be implemented, there are a few areas related to maintainability and code structure that could be improved before merging. Please see the specific comments below.

Summary of Findings

Broad Exception Handling: The LiteLLM callback handler uses a bare except: block, which is too broad and can hide unexpected errors.
Misplaced Function Definition: The log function in tracer.py is defined outside a class but uses self, indicating it's intended as a method but is incorrectly placed.
Coupling of Tracer to Integration: The core Tracer class includes LiteLLM-specific methods and state for save coordination, creating tight coupling.
Design of Save Coordination: The logic for deferring saves based on LiteLLM operations is embedded in TraceClient.save, raising questions about the best place for this coordination logic.
Modification of Frame Locals: The deep tracing mechanism modifies frame local variables to manage context, which can be less predictable than other methods.
Complexity of API Client Wrapping: The logic for wrapping various API clients and extracting data, particularly from streams, is complex and potentially fragile to changes in client libraries.
Unusual Test File Structure: The test file test_litellm.py includes an if __name__ == "__main__": block, which is not standard for test files. (Not commented on due to severity settings).
Debug Print Statements: The LiteLLM callback handler includes numerous print statements for debugging that should be replaced with proper logging. (Not commented on due to severity settings).
Fragile Type Check: The _finish_span method uses hasattr to check if start_time is a datetime object, which is less explicit than isinstance. (Not commented on due to severity settings).
Commented Out Code/Warnings: There is commented out code and warnings in tracer.py that should either be removed or uncommented/addressed. (Not commented on due to severity settings).
Minor Formatting: There are minor formatting changes (blank lines, comment alignment) that don't affect functionality. (Not commented on due to severity settings).

Merge Readiness

This pull request introduces significant new functionality and includes good test coverage. However, there are several high and medium severity issues identified related to error handling, code structure, and maintainability, particularly concerning the coupling between the core tracer and the LiteLLM integration, and the complexity of the API client wrapping. These issues should be addressed before merging to ensure the stability and maintainability of the codebase. I am unable to approve this pull request; please have other reviewers review and approve this code after the suggested changes are made.

src/e2etests/test_litellm.py

src/judgeval/common/tracer.py

gemini-code-assist · 2025-06-01T06:36:48Z

src/judgeval/common/tracer.py

+        # Check if it's safe to save (no pending LiteLLM operations)
+        with self.tracer._save_lock:
+            if not self.tracer._safe_to_save:
+                # Defer the save
+                self.tracer._deferred_save_pending = True
+                self.tracer._deferred_save_args = (self, overwrite)
+                return self.trace_id, {}  # Return placeholder


The TraceClient.save method now includes logic specifically for coordinating with LiteLLM operations (_safe_to_save, _deferred_save_pending, _deferred_save_args, _save_lock). While necessary for the integration, placing this specific coordination logic directly within the core TraceClient couples it tightly to the LiteLLM integration. Could this save coordination mechanism be abstracted into a more generic pattern that other integrations could also use, or perhaps managed more directly by the Tracer instance based on signals from integrations?

gemini-code-assist · 2025-06-01T06:36:48Z

src/judgeval/common/tracer.py

            token = current_span_var.set(span_id)
            frame.f_locals["_judgment_span_token"] = token


The deep tracing mechanism stores the context variable token (_judgment_span_token) directly in the frame's local variables. While this is a known technique for context propagation in some scenarios, modifying frame locals can sometimes be less predictable or have unintended interactions compared to other methods. Consider if there's an alternative approach to managing the context variable token within the deep tracing logic that avoids direct frame local modification, potentially leveraging contextvars features more directly or using a different mechanism for passing state.

Unrelated to your PR - we can check this out later

gemini-code-assist · 2025-06-01T06:36:49Z

src/judgeval/common/tracer.py

+    def litellm_operation_started(self):
+        """Called by LiteLLM handler when an operation starts"""
+        with self._save_lock:
+            self._safe_to_save = False
+
+    def litellm_operation_completed(self):
+        """Called by LiteLLM handler when an operation completes"""
+        deferred_save_info = None
+
+        with self._save_lock:
+            self._safe_to_save = True
+            if self._deferred_save_pending:
+                # Get the deferred save info
+                deferred_save_info = self._deferred_save_args
+                self._deferred_save_pending = False
+                self._deferred_save_args = None
+
+        # Execute deferred save outside lock if needed
+        if deferred_save_info:
+            trace_client, overwrite = deferred_save_info
+            trace_client._perform_actual_save(overwrite)


The Tracer class now includes methods (litellm_operation_started, litellm_operation_completed) and state (_safe_to_save, _deferred_save_pending, _deferred_save_args, _save_lock) specifically for coordinating saves with the LiteLLM integration. This introduces coupling between the core Tracer and a specific integration. Ideally, the core tracing mechanism should be integration-agnostic. Could the LiteLLM integration interact with the Tracer using a more generic interface (e.g., tracer.defer_save(), tracer.allow_save()) rather than the Tracer having explicit LiteLLM methods?

@alanzhang25 I was thinking about doing this but I wasn't sure about the tradeoff of generality vs making it easy to know why we have it.

@alanzhang25

I think switch it to generality but leave a comment somewhere for internal documentation that this is used by litellm

rishi763 · 2025-06-01T06:41:28Z

Can the person reviewing fix the e2etest? I can't get ts to work on my setup, but I'm certain the integration works it's just the test that is messed up.
EDIT: Resolved

alanzhang25

Nice work! Some questions, and also wondering if I can get more info on the locking logic

src/e2etests/test_litellm.py

alanzhang25 · 2025-06-04T20:59:27Z

src/judgeval/integrations/litellm_integration.py

+        self.tracer.litellm_operation_started()
+
+        span_id = str(uuid.uuid4())
+        self._current_span_id = span_id


isnt there a method for this?

I'm not getting the current span id of the tracer? Not sure what you mean here.

I meant for setting the span_id

alanzhang25 · 2025-06-04T21:01:03Z

src/judgeval/integrations/litellm_integration.py

+            created_at=time.time(),
+            span_type="llm"
+        )
+        span.inputs = {"model": model, "messages": messages}


do we need to include model as part of the input? Maybe take a look at how we save usage in tracer.py

I think we should, I'm not sure how tracer would be able to get the model from LiteLLM since model used away

Should be related to how we track usage and token cost? What do you mean by " LiteLLM since model used away"

JCamyre

Some requests and questions!

JCamyre · 2025-06-04T19:04:58Z

src/e2etests/test_litellm.py

+    mock_response.choices[0].message.content = content
+    mock_response.model = "gpt-4o"
+    mock_response.usage = Mock()
+    mock_response.usage.prompt_tokens = 10


Where do these token numbers come from?

JCamyre · 2025-06-04T19:21:48Z

src/e2etests/test_litellm.py

+        updated_span = trace_client.span_id_to_span[span_id]
+        assert updated_span.duration == 1.0
+        assert "API Error" in str(updated_span.output)
+


Can you add a check for the error field on the TraceSpan

src/e2etests/test_litellm.py

JCamyre · 2025-06-04T19:24:45Z

src/e2etests/test_litellm.py

+    assert isinstance(result, str)
+    assert len(result) > 0
+    print(f"LiteLLM Response: {result}")
+


Add checks to see if we are tracking correctly

JCamyre · 2025-06-04T19:25:29Z

src/e2etests/test_litellm.py

+    def mock_save(overwrite=False):
+        save_calls.append(("save", overwrite))
+        # Check if we should defer
+        with tracer._save_lock:


Where are these private methods from?

JCamyre · 2025-06-04T21:03:13Z

src/judgeval/integrations/litellm_integration.py

+        """Post-processing - no action needed"""
+        pass
+
+    def _finish_span(self, response_obj, start_time, end_time, error=None):


How are we tracking token usage and cost?

I will do this, didn't realise we tracked this, will do!

JCamyre · 2025-06-04T21:09:27Z

src/judgeval/common/tracer.py

            token = current_span_var.set(span_id)
            frame.f_locals["_judgment_span_token"] = token


Unrelated to your PR - we can check this out later

src/judgeval/common/tracer.py

JCamyre · 2025-06-04T21:10:54Z

src/judgeval/common/tracer.py

+            self.class_identifiers: Dict[str, str] = {}
+
+            # LiteLLM save coordination
+            self._safe_to_save = True


If we are running calls in parallel, how will this affect the locking?
Will the locking prevent saving the spans at the same time?

I'm not sure. From my discussions with Minh, he said there should be a single Tracer object. I think we can handle parallel calls by storing a variable of how many pending calls vs how many resolved calls there have been.

JCamyre · 2025-06-04T21:11:40Z

src/judgeval/common/tracer.py

+    def litellm_operation_started(self):
+        """Called by LiteLLM handler when an operation starts"""
+        with self._save_lock:
+            self._safe_to_save = False
+
+    def litellm_operation_completed(self):
+        """Called by LiteLLM handler when an operation completes"""
+        deferred_save_info = None
+
+        with self._save_lock:
+            self._safe_to_save = True
+            if self._deferred_save_pending:
+                # Get the deferred save info
+                deferred_save_info = self._deferred_save_args
+                self._deferred_save_pending = False
+                self._deferred_save_args = None
+
+        # Execute deferred save outside lock if needed
+        if deferred_save_info:
+            trace_client, overwrite = deferred_save_info
+            trace_client._perform_actual_save(overwrite)


@alanzhang25

rishi763 added 2 commits May 31, 2025 23:17

Implement LiteLLM integration

7fbba26

Create litellm e2etest

d595f36

gemini-code-assist bot reviewed Jun 1, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 1, 2025

View reviewed changes

rishi763 added 2 commits June 1, 2025 17:10

Fix e2e test

766c297

Enforce singleton

467854e

JCamyre added the Blocked label Jun 2, 2025

JCamyre self-requested a review June 4, 2025 19:01

JCamyre added In Review Task is currently being reviewed and removed Blocked labels Jun 4, 2025

alanzhang25 reviewed Jun 4, 2025

View reviewed changes

JCamyre requested changes Jun 4, 2025

View reviewed changes

rishi763 added Blocked and removed In Review Task is currently being reviewed labels Jun 5, 2025

Cleanup test

69d998c

alanzhang25 closed this Jul 26, 2025

		token = current_span_var.set(span_id)
		frame.f_locals["_judgment_span_token"] = token

Add litellm integration #296

Add litellm integration #296

Uh oh!

Conversation

rishi763 commented Jun 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rishi763 commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alanzhang25 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JCamyre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rishi763 commented Jun 1, 2025 •

edited

Loading