a single move by the discriminator in LLM self-play is “create a really good RL environment”
8,36K