Skip to content

Add litellm integration #296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Conversation

rishi763
Copy link
Contributor

@rishi763 rishi763 commented Jun 1, 2025

No description provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @rishi763, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

As gemini-code-assist, I've reviewed this pull request which introduces integration with the LiteLLM library. The core change is the addition of a custom LiteLLM callback handler (JudgevalLitellmCallbackHandler) that hooks into LiteLLM's API call lifecycle. This handler creates and updates judgeval trace spans for each LiteLLM interaction, capturing inputs, outputs, duration, and errors.

A significant part of this integration involved implementing a save coordination mechanism within the judgeval tracer. Because LiteLLM operations can be asynchronous and might complete after the main trace context manager exits, the tracer now includes logic to defer trace saving until all pending LiteLLM operations associated with that trace have completed. This ensures that LLM call details are fully captured before the trace is finalized and saved.

The pull request also includes a comprehensive test suite (test_litellm.py) to validate the callback handler's functionality, span creation, error handling, save coordination, and end-to-end tracing with mock LiteLLM calls.

Highlights

  • LiteLLM Integration: Adds a new JudgevalLitellmCallbackHandler that integrates with LiteLLM's custom logger mechanism to automatically trace LLM calls made via LiteLLM.
  • LLM Span Creation: The callback handler creates 'llm' type spans for each LiteLLM API call, recording the model, messages (inputs), response content (output), duration, and any errors.
  • Save Coordination: Introduces a mechanism in the Tracer and TraceClient to defer trace saving when LiteLLM operations are in progress, ensuring that asynchronous LLM call results are included in the final trace data.
  • Comprehensive Testing: A new test file (test_litellm.py) is added with multiple tests covering the callback handler's lifecycle, span updates, error handling, save deferral, and end-to-end integration.

Changelog

Click here to see the changelog
  • src/e2etests/test_litellm.py
    • Adds new end-to-end tests for the LiteLLM integration.
    • Includes tests for callback handler creation and registration.
    • Tests span creation, updates, and error handling for LiteLLM calls.
    • Validates the save coordination mechanism to ensure traces are saved after LiteLLM operations complete.
    • Tests tracing multiple LiteLLM calls within a single trace.
  • src/judgeval/common/tracer.py
    • Adds _safe_to_save, _deferred_save_pending, _deferred_save_args, and _save_lock attributes to Tracer for save coordination.
    • Introduces litellm_operation_started and litellm_operation_completed methods in Tracer to manage the save state.
    • Modifies TraceClient.save to check the _safe_to_save flag and defer saving if necessary.
    • Extracts the actual save logic into a new private method _perform_actual_save.
    • Adds an optional span_id parameter to TraceClient.async_evaluate (line 404) to allow external handlers to link evaluations to specific spans.
    • Includes minor formatting and comment adjustments.
  • src/judgeval/integrations/litellm_integration.py
    • Adds a new file containing the JudgevalLitellmCallbackHandler class.
    • Implements the litellm.integrations.custom_logger.CustomLogger interface.
    • Hooks into log_pre_api_call, log_success_event, log_failure_event, and their async counterparts.
    • Creates and updates TraceSpan objects within the current judgeval trace for each LiteLLM call.
    • Signals the Tracer when a LiteLLM operation starts and completes to manage save coordination.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


LLM speaks, a stream flows,
Tracer watches, where thought goes.
Span captures the light,
Through day and through night,
Till the final token shows.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall, this pull request introduces the LiteLLM integration and adds corresponding tests. The tests cover various aspects of the integration, which is good. The changes in tracer.py implement the necessary context propagation and save coordination logic. The core idea of using a LiteLLM callback handler to integrate with the tracer is a sound approach.

I've reviewed the code focusing on correctness, efficiency, maintainability, and security, keeping in mind the request to comment on medium, high, and critical severity issues. I've also considered common Python best practices, defaulting to principles aligned with PEP 8 where applicable, as no specific style guide was provided.

While the functionality seems to be implemented, there are a few areas related to maintainability and code structure that could be improved before merging. Please see the specific comments below.

Summary of Findings

  • Broad Exception Handling: The LiteLLM callback handler uses a bare except: block, which is too broad and can hide unexpected errors.
  • Misplaced Function Definition: The log function in tracer.py is defined outside a class but uses self, indicating it's intended as a method but is incorrectly placed.
  • Coupling of Tracer to Integration: The core Tracer class includes LiteLLM-specific methods and state for save coordination, creating tight coupling.
  • Design of Save Coordination: The logic for deferring saves based on LiteLLM operations is embedded in TraceClient.save, raising questions about the best place for this coordination logic.
  • Modification of Frame Locals: The deep tracing mechanism modifies frame local variables to manage context, which can be less predictable than other methods.
  • Complexity of API Client Wrapping: The logic for wrapping various API clients and extracting data, particularly from streams, is complex and potentially fragile to changes in client libraries.
  • Unusual Test File Structure: The test file test_litellm.py includes an if __name__ == "__main__": block, which is not standard for test files. (Not commented on due to severity settings).
  • Debug Print Statements: The LiteLLM callback handler includes numerous print statements for debugging that should be replaced with proper logging. (Not commented on due to severity settings).
  • Fragile Type Check: The _finish_span method uses hasattr to check if start_time is a datetime object, which is less explicit than isinstance. (Not commented on due to severity settings).
  • Commented Out Code/Warnings: There is commented out code and warnings in tracer.py that should either be removed or uncommented/addressed. (Not commented on due to severity settings).
  • Minor Formatting: There are minor formatting changes (blank lines, comment alignment) that don't affect functionality. (Not commented on due to severity settings).

Merge Readiness

This pull request introduces significant new functionality and includes good test coverage. However, there are several high and medium severity issues identified related to error handling, code structure, and maintainability, particularly concerning the coupling between the core tracer and the LiteLLM integration, and the complexity of the API client wrapping. These issues should be addressed before merging to ensure the stability and maintainability of the codebase. I am unable to approve this pull request; please have other reviewers review and approve this code after the suggested changes are made.

Comment on lines +596 to +602
# Check if it's safe to save (no pending LiteLLM operations)
with self.tracer._save_lock:
if not self.tracer._safe_to_save:
# Defer the save
self.tracer._deferred_save_pending = True
self.tracer._deferred_save_args = (self, overwrite)
return self.trace_id, {} # Return placeholder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TraceClient.save method now includes logic specifically for coordinating with LiteLLM operations (_safe_to_save, _deferred_save_pending, _deferred_save_args, _save_lock). While necessary for the integration, placing this specific coordination logic directly within the core TraceClient couples it tightly to the LiteLLM integration. Could this save coordination mechanism be abstracted into a more generic pattern that other integrations could also use, or perhaps managed more directly by the Tracer instance based on signals from integrations?

Comment on lines 835 to 836
token = current_span_var.set(span_id)
frame.f_locals["_judgment_span_token"] = token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deep tracing mechanism stores the context variable token (_judgment_span_token) directly in the frame's local variables. While this is a known technique for context propagation in some scenarios, modifying frame locals can sometimes be less predictable or have unintended interactions compared to other methods. Consider if there's an alternative approach to managing the context variable token within the deep tracing logic that avoids direct frame local modification, potentially leveraging contextvars features more directly or using a different mechanism for passing state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to your PR - we can check this out later

Comment on lines +1070 to +1090
def litellm_operation_started(self):
"""Called by LiteLLM handler when an operation starts"""
with self._save_lock:
self._safe_to_save = False

def litellm_operation_completed(self):
"""Called by LiteLLM handler when an operation completes"""
deferred_save_info = None

with self._save_lock:
self._safe_to_save = True
if self._deferred_save_pending:
# Get the deferred save info
deferred_save_info = self._deferred_save_args
self._deferred_save_pending = False
self._deferred_save_args = None

# Execute deferred save outside lock if needed
if deferred_save_info:
trace_client, overwrite = deferred_save_info
trace_client._perform_actual_save(overwrite)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Tracer class now includes methods (litellm_operation_started, litellm_operation_completed) and state (_safe_to_save, _deferred_save_pending, _deferred_save_args, _save_lock) specifically for coordinating saves with the LiteLLM integration. This introduces coupling between the core Tracer and a specific integration. Ideally, the core tracing mechanism should be integration-agnostic. Could the LiteLLM integration interact with the Tracer using a more generic interface (e.g., tracer.defer_save(), tracer.allow_save()) rather than the Tracer having explicit LiteLLM methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanzhang25 I was thinking about doing this but I wasn't sure about the tradeoff of generality vs making it easy to know why we have it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think switch it to generality but leave a comment somewhere for internal documentation that this is used by litellm

@rishi763
Copy link
Contributor Author

rishi763 commented Jun 1, 2025

Can the person reviewing fix the e2etest? I can't get ts to work on my setup, but I'm certain the integration works it's just the test that is messed up.
EDIT: Resolved

@JCamyre JCamyre added the Blocked label Jun 2, 2025
@JCamyre JCamyre self-requested a review June 4, 2025 19:01
@JCamyre JCamyre added In Review Task is currently being reviewed and removed Blocked labels Jun 4, 2025
Copy link
Collaborator

@alanzhang25 alanzhang25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Some questions, and also wondering if I can get more info on the locking logic

self.tracer.litellm_operation_started()

span_id = str(uuid.uuid4())
self._current_span_id = span_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt there a method for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not getting the current span id of the tracer? Not sure what you mean here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for setting the span_id

created_at=time.time(),
span_type="llm"
)
span.inputs = {"model": model, "messages": messages}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to include model as part of the input? Maybe take a look at how we save usage in tracer.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should, I'm not sure how tracer would be able to get the model from LiteLLM since model used away

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be related to how we track usage and token cost? What do you mean by " LiteLLM since model used away"

Copy link
Collaborator

@JCamyre JCamyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some requests and questions!

mock_response.choices[0].message.content = content
mock_response.model = "gpt-4o"
mock_response.usage = Mock()
mock_response.usage.prompt_tokens = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these token numbers come from?

updated_span = trace_client.span_id_to_span[span_id]
assert updated_span.duration == 1.0
assert "API Error" in str(updated_span.output)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a check for the error field on the TraceSpan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

assert isinstance(result, str)
assert len(result) > 0
print(f"LiteLLM Response: {result}")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add checks to see if we are tracking correctly

def mock_save(overwrite=False):
save_calls.append(("save", overwrite))
# Check if we should defer
with tracer._save_lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these private methods from?

"""Post-processing - no action needed"""
pass

def _finish_span(self, response_obj, start_time, end_time, error=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we tracking token usage and cost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this, didn't realise we tracked this, will do!

Comment on lines 835 to 836
token = current_span_var.set(span_id)
frame.f_locals["_judgment_span_token"] = token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to your PR - we can check this out later

self.class_identifiers: Dict[str, str] = {}

# LiteLLM save coordination
self._safe_to_save = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are running calls in parallel, how will this affect the locking?
Will the locking prevent saving the spans at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. From my discussions with Minh, he said there should be a single Tracer object. I think we can handle parallel calls by storing a variable of how many pending calls vs how many resolved calls there have been.

Comment on lines +1070 to +1090
def litellm_operation_started(self):
"""Called by LiteLLM handler when an operation starts"""
with self._save_lock:
self._safe_to_save = False

def litellm_operation_completed(self):
"""Called by LiteLLM handler when an operation completes"""
deferred_save_info = None

with self._save_lock:
self._safe_to_save = True
if self._deferred_save_pending:
# Get the deferred save info
deferred_save_info = self._deferred_save_args
self._deferred_save_pending = False
self._deferred_save_args = None

# Execute deferred save outside lock if needed
if deferred_save_info:
trace_client, overwrite = deferred_save_info
trace_client._perform_actual_save(overwrite)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishi763 rishi763 added Blocked and removed In Review Task is currently being reviewed labels Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants