You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Basically, Locomo benchmark offers some long conversations and prepare some questions. LLM Judge Score is to use LLM(*e.g.* OpenAI `gpt-4o`) to judge if the answer generated from memory method is the same as the ground truth, score is 1 if it is, else 0.
32
33
33
34
We attached the artifacts of Memobase under `fixture/memobase/`:
34
35
35
-
-`fixture/memobase/results_0503_3000.json`: predicted answers from Memobase Memory
36
-
-`fixture/memobase/memobase_eval_0503_3000.json`: LLM Judge results of predicted answers
36
+
- v0.0.32
37
+
-`fixture/memobase/results_0503_3000.json`: predicted answers from Memobase Memory
38
+
-`fixture/memobase/memobase_eval_0503_3000.json`: LLM Judge results of predicted answers
37
39
38
-
To generate the scorings, run:
40
+
- v0.0.37-a1
41
+
-`fixture/memobase/results_0709_3000.json`: predicted answers from Memobase Memory
42
+
-`fixture/memobase/memobase_eval_0709_3000.json`: LLM Judge results of predicted answers
0 commit comments