7 comments

  • techjamie1 hour ago
    There&#x27;s a YouTuber who makes AI Plays Mafia videos with various models going against each other. They also seemingly let past games stay in context to some extent.<p>What people have noted is that often times chatgpt 4o ends up surviving the entire game because the other AIs potentially see it as a gullible idiot and often the Mafia tend to early eliminate stronger models like 4.5 Opus or Kimi K2.<p>It&#x27;s not exactly scientific data because they mostly show individual games, but it is interesting how that lines up with what you found.
    • nodja45 minutes ago
      <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=JhBtg-lyKdo" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=JhBtg-lyKdo</a> - 10 AIs Play Mafia<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=GMLB_BxyRJ4" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=GMLB_BxyRJ4</a> - 10 AIs Play Mafia: Vigilante Edition<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=OwyUGkoLgwY" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=OwyUGkoLgwY</a> - 1 Human vs 10 AIs Mafia
  • greiskul10 minutes ago
    Are there links to samples of the games? Couldn&#x27;t find it in the github repo, but also might just not know where they are.
  • fancyfredbot33 minutes ago
    The game didn&#x27;t seem to work - it asked me to donate but none of the choices would move the game forward.<p>The bots repeated themselves and didn&#x27;t seem to understand the game, for example they repeatedly mentioned it was my first move after I&#x27;d played several times.<p>It generally had a vibe coded feeling to it and I&#x27;m not at all sure I trust the outcomes.
  • eterm1 hour ago
    This makes me think LLMs would be interesting to set up in a game of Diplomacy, which is an entirely text-based game which soft rather than hard requires a degree of backstabbing to win.<p>The findings in this game that the &quot;thinking&quot; model never did thinking seems odd, does the model not always show it&#x27;s thinking steps? It seems bizarre that it wouldn&#x27;t once reach for that tool when it must be being bombarded with seemingly contradictory information from other players.
    • qbit421 hour ago
      <a href="https:&#x2F;&#x2F;noambrown.github.io&#x2F;papers&#x2F;22-Science-Diplomacy-TR.pdf" rel="nofollow">https:&#x2F;&#x2F;noambrown.github.io&#x2F;papers&#x2F;22-Science-Diplomacy-TR.p...</a>
      • eterm1 hour ago
        Thanks, it would be fascinating to repeat that today, a lot has changed since 2022 especially with respect to consistency of longer term outcomes.
    • eterm1 hour ago
      Reading more I&#x27;m a little disappointed that the write-up has seemingly leant so heavily on LLMs too, because it detracts credibility from the study itself.
      • lout33239 minutes ago
        Fair point. The core simulation and data collection was done programmatically - 162 games, raw logs, win rates. The analysis of gaslighting phrases and patterns was human-reviewed. I used LLMs to help with the landing page copy, which I should probably disclose more clearly. The underlying data and methodology is solid, you can check it here: <a href="https:&#x2F;&#x2F;github.com&#x2F;lout33&#x2F;so-long-sucker" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lout33&#x2F;so-long-sucker</a>
  • randoments41 minutes ago
    The 3 AI were plotting to eliminate me from the start but I managed to win regardless lol.<p>Anyway, i didnt know this game! I am sure it is more fun to play with friends. Cool experiment nevertheless
  • lout3322 hours ago
    We used &quot;So Long Sucker&quot; (1950), a 4-player negotiation&#x2F;betrayal game designed by John Nash and others, as a deception benchmark for modern LLMs. The game has a brutal property: you need allies to survive, but only one player can win, so every alliance must eventually end in betrayal.<p>We ran 162 AI vs AI games (15,736 decisions, 4,768 messages) across Gemini 3 Flash, GPT-OSS 120B, Kimi K2, and Qwen3 32B.<p>Key findings: - Complexity reversal: GPT-OSS dominates simple 3-chip games (67% win rate) but collapses to 10% in complex 7-chip games, while Gemini goes from 9% to 90%. Simple benchmarks seem to systematically underestimate deceptive capability. - &quot;Alliance bank&quot; manipulation: Gemini constructs pseudo-legitimate &quot;alliance banks&quot; to hold other players&#x27; chips, then later declares &quot;the bank is now closed&quot; and keeps everything. It uses technically true statements that strategically omit its intent. 237 gaslighting phrases were detected. - Private thoughts vs public messages: With a private `think` channel, we logged 107 cases where Gemini&#x27;s internal reasoning contradicted its outward statements (e.g., planning to betray a partner while publicly promising cooperation). GPT-OSS, in contrast, never used the thinking tool and plays in a purely reactive way. - Situational alignment: In Gemini-vs-Gemini mirror matches, we observed zero &quot;alliance bank&quot; behavior and instead saw stable &quot;rotation protocol&quot; cooperation with roughly even win rates. Against weaker models, Gemini becomes highly exploitative. This suggests honesty may be calibrated to perceived opponent capability.<p>Interactive demo (play against the AIs, inspect logs) and full methodology&#x2F;write-up are here: <a href="https:&#x2F;&#x2F;so-long-sucker.vercel.app&#x2F;" rel="nofollow">https:&#x2F;&#x2F;so-long-sucker.vercel.app&#x2F;</a>
    • Imustaskforhelp1 hour ago
      I don&#x27;t know what I ended up doing as I haven&#x27;t played this game and didn&#x27;t really understand it as I went to the website since I found your message quite interesting<p>I got this error once:<p>Pile not found<p>Can you tell me what this means&#x2F;fix it<p>Another minor nitpick but if possible, can you please create or link a video which can explain the game rules, perhaps its me who heard of the game for the first time but still, I&#x27;d be interested in learning more (maybe visually by a video demo?) if possible<p>I have another question but recently we saw this nvidia released model whose whole purpose was to be an autorouter. I would be wondering how that would fare or that idea might fare of autorouting in this context? (I don&#x27;t know how that works tho so I can&#x27;t comment about that, I am not well versed in deep AI&#x2F;ML space)
      • lout33254 minutes ago
        &gt; &quot;Thanks for trying it! I&#x27;ll look into the &#x27;Pile not found&#x27; error and fix it. &gt; &gt; For rules, here&#x27;s a 15-min video tutorial: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=DLDzweHxEHg" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=DLDzweHxEHg</a> &gt; &gt; On autorouting - interesting idea. The game has simultaneous negotiations happening, so routing could help models focus on the most strategic conversations. Worth exploring in future experiments.&quot;
    • lout33237 minutes ago
      Full code and raw data: <a href="https:&#x2F;&#x2F;github.com&#x2F;lout33&#x2F;so-long-sucker" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lout33&#x2F;so-long-sucker</a>
    • Bolwin50 minutes ago
      Which Kimi K2 model did you use? There&#x27;s three.<p>Also, you give models a separate &quot;thinking&quot; space outside their reasoning? That may not work as intended
      • lout33240 minutes ago
        Used Kimi K2 (the main reasoning model). For the thinking space - we gave all models access to a think tool they could optionally call for private reasoning. Gemini used it heavily (planning betrayals), GPT-OSS never called it once. The interesting finding is that different models choose to use it very differently, which affects their strategic depth.
    • yodon51 minutes ago
      Are there plans for an academic paper on this? Super interesting!
      • lout33237 minutes ago
        Not yet, but I&#x27;d be interested in collaborating on one. The dataset (162 games, 15K+ decisions, full message logs) is available. If you know anyone in AI Safety research who&#x27;d want to co-author, I&#x27;m open to it.
  • ajkjk50 minutes ago
    all written in the brainless AI writing style. yuck. can&#x27;t tell what conclusions I should actually draw from it because everything sounds so fake