Discussing the state of LLM-as-a-Judge - is it good enough to use? (human edition)

This is about connection - both with a fellow human interested in and articulate about Artifical Intelligence and the connection of the information inputed, processed, and produced.

The fellow human is Akhil Theerthala, also a member of Cohere Labs and working on AI for Finance. We meet at the monthly lightening talks that the Cohere Labs community managers host (thanks Madeline and Brittawnya).

The information inputed was the A Survey on LLM-as-a-Judge paper, an agreement to read it and present our thoughts then discuss fuerther.

Now I was going to summarise everything in a blog post, and I recorded the meeting so that it could be used, by me, as a reference. However, I have uploaded the transcript, the core paper and papers we touched on, into NotebookLM and now I am thinking that the way we share information really has changed - significantly.

For those that wish to have a human summary, if you want to play with the information in NotebookLM skip this for now;

  • Akhil and I approached the paper from the point of view “how can I apply LLM-as-a-Judge”, Akhil for his work on LLMs in Finance and me for my working on Agentic tasks.
  • We both found areas that were stronger than other areas and noted that the weak areas are important for using this technique.
  • The paper does do a good job of formalising the approach, however it feels like a toy formalisation and has room for more precision and robustenss.
  • There are contradictions, both to previous studies (Akhil highlighted “Let me Speak Freely”) and to itself (I’d noted the discussion about Structured Output causing problems in section 2 and Akhil had seen it say that Structured Output helped reasoning in Section 3)
  • The evaluation are intersting and the bias taxonomy opens the door to learning more. Though it is difficult to understand the connection of these biases to researched and documented biases (possibly because the subject is vast).
  • The best open source model, only a 7B parameter model) had surprisingly good evals. Questions about the positives and negatives of that finding arise!
  • The conclusion is generic and doesn’t really talk to the detail covered in the paper. It’s more a generic call to action, lacking connection to the needs or prioritisation of the possible processes in the paper.

To be clear I found the taxonomy and formalistion excellent in a novel field, the paper really sets out and does define a clear way of thinking about the challenges and potentialities of using LLM as Judges.

For both of us the paper doesn’t answer all the questions we had. It asks some pertainent questions as well, mainly about biases and how to manage those. The key area that is missing is Evaluation by Domain. There is a clear need for this sort of information being made available to engineers, companies, etc… that are thinking about using this technique to evaluate LLM outputs. We also asked how does it compare to RAG evaluations (like RAGAS) - that’s an open question for us.

I’m left wondering if inspriation can be taken from the CoT Self-Consistency paper (Akhil pointed me at the Refine n Judge work by Meta) - the paper clearly talks about Majority Vote, however the details aren’t clear. This is an area where the formalisation could also be researched - presently it looks at a high level input output, is there scope for having a reasoning based Judge?

Hope that’s given you food for thought, reach out if you would like to discuss further - we are planning another chat in a couple of weeks - Akhil’s turn to pick the paper (hey Akhil - wanna do the RAGAS paper yeah?? ;-) )

Whatever the paper, I know it’ll be a good conversation, so yeah, join us on the Cohere Labs discord server or other Social Media.

Peace, Matt

here’s the links if you skipped to the end:

NotebookLM: LLM as Judge: Evaluation, Improvement, and Refinement Youtube:

YouTube Thumbnail

Responsible AI Learning Agentic AI Research