Skip to content

Commit 7991dd5

Browse files
committed
docs: update locomo scoring
1 parent 66b4863 commit 7991dd5

File tree

4 files changed

+36920
-21
lines changed

4 files changed

+36920
-21
lines changed

docs/experiments/locomo-benchmark/README.md

Lines changed: 26 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -17,44 +17,51 @@ This project contains the code of running benchmark results on [Locomo dataset](
1717

1818
- We mainly report the LLM Judge Sorce (higher is better).
1919

20-
| Method | Single-Hop(%) | Multi-Hop(%) | Open Domain(%) | Temporal(%) | Overall(%) |
21-
| ---------- | ------------- | ------------ | -------------- | ----------- | ---------- |
22-
| Mem0 | **67.13** | 51.15 | 72.93 | 55.51 | 66.88 |
23-
| Mem0-Graph | 65.71 | 47.19 | 75.71 | 58.13 | 68.44 |
24-
| LangMem | 62.23 | 47.92 | 71.12 | 23.43 | 58.10 |
25-
| Zep | 61.70 | 41.35 | **76.60** | 49.31 | 65.99 |
26-
| OpenAI | 63.79 | 42.92 | 62.29 | 21.71 | 52.90 |
27-
| Memobase | 63.83 | **52.08** | 71.82 | **80.37** | **70.91** |
20+
| Method | Single-Hop(%) | Multi-Hop(%) | Open Domain(%) | Temporal(%) | Overall(%) |
21+
| ---------------------- | ------------- | ------------ | -------------- | ----------- | ---------- |
22+
| Mem0 | **67.13** | 51.15 | 72.93 | 55.51 | 66.88 |
23+
| Mem0-Graph | 65.71 | 47.19 | 75.71 | 58.13 | 68.44 |
24+
| LangMem | 62.23 | 47.92 | 71.12 | 23.43 | 58.10 |
25+
| Zep | 61.70 | 41.35 | **76.60** | 49.31 | 65.99 |
26+
| OpenAI | 63.79 | 42.92 | 62.29 | 21.71 | 52.90 |
27+
| Memobase(*v0.0.32*) | 63.83 | **52.08** | 71.82 | **80.37** | **70.91** |
28+
| Memobase(*v0.0.37-a1*) | **67.60** | 39.76 | **76.65** | 78.87 | **73.27** |
2829

2930
> **What is LLM Judge Score?**
3031
>
3132
> Basically, Locomo benchmark offers some long conversations and prepare some questions. LLM Judge Score is to use LLM(*e.g.* OpenAI `gpt-4o`) to judge if the answer generated from memory method is the same as the ground truth, score is 1 if it is, else 0.
3233
3334
We attached the artifacts of Memobase under `fixture/memobase/`:
3435

35-
- `fixture/memobase/results_0503_3000.json`: predicted answers from Memobase Memory
36-
- `fixture/memobase/memobase_eval_0503_3000.json`: LLM Judge results of predicted answers
36+
- v0.0.32
37+
- `fixture/memobase/results_0503_3000.json`: predicted answers from Memobase Memory
38+
- `fixture/memobase/memobase_eval_0503_3000.json`: LLM Judge results of predicted answers
3739

38-
To generate the scorings, run:
40+
- v0.0.37-a1
41+
- `fixture/memobase/results_0709_3000.json`: predicted answers from Memobase Memory
42+
- `fixture/memobase/memobase_eval_0709_3000.json`: LLM Judge results of predicted answers
43+
44+
To generate the latest scorings, run:
3945

4046
```bash
41-
python generate_scores.py --input_path="fixture/memobase/memobase_eval_0503_3000.json"
47+
python generate_scores.py --input_path="fixture/memobase/memobase_eval_0709_3000.json"
4248
```
4349

4450
Output:
4551

4652
```
53+
Mean Scores Per Category:
4754
bleu_score f1_score llm_score count type
4855
category
49-
1 0.3045 0.4283 0.6383 282 single_hop
50-
2 0.4582 0.6438 0.8037 321 temporal
51-
3 0.2078 0.3085 0.5208 96 multi_hop
52-
4 0.3429 0.4698 0.7182 841 open_domain
56+
1 0.3048 0.4254 0.6760 250 single_hop
57+
2 0.4323 0.6052 0.7887 284 temporal
58+
3 0.1943 0.2616 0.3976 83 multi_hop
59+
4 0.4121 0.5207 0.7665 771 open_domain
5360
5461
Overall Mean Scores:
55-
bleu_score 0.3515
56-
f1_score 0.4884
57-
llm_score 0.7091
62+
bleu_score 0.3839
63+
f1_score 0.5053
64+
llm_score 0.7327
5865
dtype: float64
5966
```
6067

0 commit comments

Comments
 (0)