I assume OP watched this excellent Veritasium video about Markov and his chains [0] and posted this article which was referenced in that video.<p>[0] <a href="https://youtu.be/KZeIEiBrT_w" rel="nofollow">https://youtu.be/KZeIEiBrT_w</a>
You nailed it, I wanted to post the actual article but found that they where behind a paywall, and I don't know if pirated content is allowed to share here but it's definitely worth a read though!
Top comment on the video gave me a chuckle.<p>> I wonder if Markov chains could predict how many times Veritasium changes the thumbnail and title of this video.
Honestly a pretty bad video by Veritasium standards.<p>Consider the nuclear reaction metaphor. It's clearly not memoryless. Eventually you'll run out of fissile material.<p>The diagram for that example is bad as well. Do arrows correspond to state transitions? Or do they correspond to forks in the process where one neutron results in two?
3.4 Mio views in 4 days. Wow…
I assumed he read the Illuminatus! Trilogy and wondered where Markoff Chaney's name originated... There might be something wrong with me, now that I think about it. /s
Markov Chains are a gateway drug to more advanced probabilistic graphic models which are worth exploring. I still remember working throughout Koller&Friedman cover to cover as one of the best learning experiences I’ve ever had.
PGMs are a great topic to explore (and Koller+Friedman is a great book), but, as a word of caution to anyone interested: implementation of any of these more advanced models remains a <i>major</i> challenge. For anyone building production facing models, even if your problem is a pretty good match for the more interesting PGMs, the engineering requirements alone are a good reason not to go too far down that path.<p>The PGM book is also structured very clearly for researchers in PGMs. The book is laid out in 3 major section: the models, inference techniques (the bulk of the book), and learning. Which means, if you follow the logic of the book, you basically have to work through 1000+ pages of content before you can actually start running even toy versions of these models. If you do need to get into the nitty-gritty of particular inference algorithms, I don't believe there is another textbook with nearly the level of scope and detail.<p>Bishop's section on PGMs from <i>Pattern Recognition and Machine Learning</i> is probably a better place to start learning about these more advanced models, and if you become very interested then Koller+Friedman will be an invaluable text.<p>It's worth noting that the PGM course taught by Koller was one of the initial, and still very excellent, Coursera courses. I'm not sure if it's still free, but it was a nice way to get a deep dive into the topic in a reasonably short time frame (I do remember those homeworks as brutal though!)[0].<p>0. <a href="https://www.coursera.org/specializations/probabilistic-graphical-models" rel="nofollow">https://www.coursera.org/specializations/probabilistic-graph...</a>
Played with Bayesian nets a bit in grad school—Pearl’s causality stuff is still mind-blowing—but I’ve almost never bumped into a PGM in production.
A couple things kept biting us:
Inference pain. Exact is NP-hard, and the usual hacks (loopy BP, variational, MCMC) need a ton of hand-tuning before they run fast enough.<p>The data never fits the graph. Real-world tables are messy and full of hidden junk, so you either spend weeks arguing over structure or give up the nice causal story.<p>DL stole the mind-share. A transformer is a one-liner with a mature tooling stack; hard to argue with that when deadlines loom.<p>That said, they’re not completely dead - reportedly Microsoft’s TrueSkill (Xbox ranking), a bunch of Google ops/diagnosis pipelines, some healthcare diagnosis tools by IBM Watson built on Infer.NET.<p>Anyone here actually shipped a PGM that beat a neural baseline? Would really love to appreciate your war stories.
Me either. I have heard stories of it happening, but never personally seen one live. It's really a tooling issue. I think the causal story is super important and will only become more so in the future, but it would be basically impossible to implement and maintain longer-term with today's software.<p>Kind of like flow-based programming. I don't think there are any fundamental reason why it can't work, it just hasn't yet.
> Pearl’s causality stuff is still mind-blowing<p>Could you link me to where I could learn more about this?
From recollection, I believe, "The Book of Why", <a href="https://a.co/d/aYehsnO" rel="nofollow">https://a.co/d/aYehsnO</a>, is the general audience introduction to Pearl's approach to causality.<p>"Causality: Models, Reasoning and Inference", <a href="https://a.co/d/6b3TKhQ" rel="nofollow">https://a.co/d/6b3TKhQ</a>, is the technical and researcher audience book.
Just bought it last week. This book just makes me happy.<p>It feels like bishops pattern recognition but with a clearer tone (and a different field of course)
I feel like we need a video on Dynamic Markov Chains. It's a method to create a markov chain from data. It's used in all the highest compression winners in the Hutter Prize (a competition to compress data the most).
You mean the algorithm used in hook[0]? These are not really top performers anymore. PPM has generally performed better and nowadays it's LLMs and context mixers that are at the top of text compression[1]<p>[0]: <a href="https://mattmahoney.net/dc/dce.html#Section_421" rel="nofollow">https://mattmahoney.net/dc/dce.html#Section_421</a>
[1]: <a href="https://mattmahoney.net/dc/text.html" rel="nofollow">https://mattmahoney.net/dc/text.html</a>
DMC is used in pretty much all of these, just not alone (that alg column is too small to capture this).<p>As in go into the first open source entry which is #2 in this list, cmix, unzip the files, go into paq8.cpp and search for DMC. See "Model using DMC (Dynamic Markov Compression)" and associated code. In these cases DMC is one model mixed in with others and the best model for the current context is used.<p>Hook exclusively uses DMC for outstanding results but the others use DMC as one of the prediction models.
Make your own video then :)
Heh, somebody watched that same Veritasium video about Markov Chains and Monte Carlo methods. Great video; lots of interesting historical stuff in there that I didn't know (like the feud between Markov and Nekrasov).
For awhile I was working on Monte Carlo sims in my job in finance in my early career. Just re-building a existing archaic excel monster in python to be more flexible to new investment models and implement and allow for more levers. Since I was already working with them daily I begin applying Monte Carlo models to a lot more problems I was thinking about. It truly is a fun tool to play with, especially when you're in the thick of designing them.
If you're holding a hammer every problem looks like a nail ...
Yea, if it were true, then Markov chains on steroids (LLMs) would be super at problem solving, but they can't even reason. And they get worse if you add random facts about cats: <a href="https://news.ycombinator.com/item?id=44724238">https://news.ycombinator.com/item?id=44724238</a>
That doesn’t follow at all, but I guess don’t pass up any opportunity to bash LLMs.
LLMs are Markov Chains on steroids?<p>By the way, does anyone know which model or type of model was used in winning gold in IMO?
> LLMs are Markov Chains on steroids?<p>It's not an unreasonable view, at least for decoder-only LLMs (which is what most popular LLMs are). While it may seem they violate the Markov property since they clearly do make use of their history, in practice that entire history is summarized in an embedding passed into the decoder. I.e.just like a Markov chain their entire history is compressed into a single point which leaves them conditionally independent of their past given their present state.<p>It's worth noting that this claim is NOT <i>generally applicable</i> to LLMs since both encoder/decoder and encoder-only LLMs <i>do violate</i> the Markov property and therefore cannot be properly considered Markov chains in a meaningful way.<p>But running inference on decoder only model is, at a high enough level of abstraction, is conceptually the same as running a Markov chain (on steroids).
A Markov process is any process where if you have perfect information on the current state, you cannot gain more information about the next state by looking at any previous state.<p>Physics models of closed systems moving under classical mechanics are deterministic, continuous Markov processes. Random walks on a graph are non deterministic, discrete Markov processes.<p>You may further generalize that if a process has state X, and the prior N states contribute to predicting the next state, you can make a new process whose state is an N-vector of Xs, and the graph connecting those states reduces the evolution of the system to a random walk on a graph, and thus a Markov process.<p>Thus any system where the best possible model of its evolution requires you to examine at most finitely many consecutive states immediately preceding the current state is a Markov process.<p>For example, an LLM that will process a finite context window of tokens and then emit a weighted random token is most definitely a Markov process.
> LLMs are Markov Chains on steroids?<p>Might be a reference to this[1] blog post which was posted here[2] a year ago.<p>There has also been some academic work linking the two, like this[3] paper.<p>[1]: <a href="https://elijahpotter.dev/articles/markov_chains_are_the_original_language_models" rel="nofollow">https://elijahpotter.dev/articles/markov_chains_are_the_orig...</a><p>[2]: <a href="https://news.ycombinator.com/item?id=39213410">https://news.ycombinator.com/item?id=39213410</a><p>[3]: <a href="https://arxiv.org/abs/2410.02724" rel="nofollow">https://arxiv.org/abs/2410.02724</a>
LLMs aren't Markov chains unless they have a context window of 1.<p>>In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event
The tricky thing is you get to define the state. So if the "state" is the current word _and_ the previous 10 it is still "memoryless". So an LLM's context window is the state. It doesn't matter whether _we_ see parts of the state as called history, the markov chain doesn't care (they are all just different features).<p>Edit: I could be missing important nuance that other people are pointing out in this thread!
I mean, often, sure. Also problem solving is often a matter of making a spreadsheet full of +, -, *, / operations. Problem solving is often a matter of counting permutations and combinations. It's often a matter of setting up the right system of ordinary differential equations. It's often a matter of setting up the right linear algebra problem.<p>It's often a matter of asking the right person what technique works. It's often a matter of making a measurement before getting lost in the math. It's often a matter of finding the right paper in the literature.
Note to file: No one is complaining about the absence of [PDF] from the title. Possible that time of submission is a factor.<p>For example, <a href="https://news.ycombinator.com/item?id=44574033">https://news.ycombinator.com/item?id=44574033</a>
The Veritasium video brought up an interesting point about how LLMs, if trained too much on their own content, will fall victim to the Markov Chain and just repeat the same thing over and over.<p>Is this still possible with the latest models being trained on synthetic data? And if it possible, what would that one phrase be?
That original model collapse paper has largely been misunderstood and in practice, this is only true if you're not curating the generated data at all. The original paper even specifies (emphasis mine):<p>> We find that <i>indiscriminate</i> use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. [0]<p>In practice <i>nobody</i> is "indiscriminately" using model output to fine-tune models since that doesn't even make sense. Even if you're harvesting web data generated by LLMs, that data has in fact been curated by it's acceptance on whatever platform you found it on is a form of curation.<p>There was a very recent paper <i>Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)</i> [1] whose content is pretty well summarized by the title. So long as the data is curated in some way, you are providing more information to the model and the results should improve somewhat.<p>0. <a href="https://www.nature.com/articles/s41586-024-07566-y" rel="nofollow">https://www.nature.com/articles/s41586-024-07566-y</a><p>1. <a href="https://www.arxiv.org/pdf/2507.12856" rel="nofollow">https://www.arxiv.org/pdf/2507.12856</a><p>edit: updated based on cooksnoot's comment
I am on the move so I cannot check the video (but I did skim the pdf). Is there any chance to see an example of this technique? Just a toy/trivial example would be great, TIA!
For the Monte Carlo Method stuff in particular[1], I get the sense that the most iconic "Hello, World" example is using MC to calculate the value of pi. I can't explain it in detail from memory, but it's approximately something like this:<p>Define a square of some known size (1x1 should be fine, I think)<p>Inscribe a circle inside the square<p>Generate random points inside the square<p>Look at how many fall inside the square but not the circle, versus the ones that do fall in the circle.<p>From that, using what you know about the area of the square and circle respectively, the ratio of "inside square but not in circle" and "inside circle" points can be used to set up an equation for the value of pi.<p>Somebody who's more familiar with this than me can probably fix the details I got wrong, but I think that's the general spirit of it.<p>For Markov Chains in general, the only thing that jumps to mind for me is generating text for old school IRC bots. :-)<p>[1]: which is probably not the point of this essay. For for muddying the waters, I have both concepts kinda 'top of mind' in my head right now after watching the Veritasium video.
Sorry, I should have been more specific maybe: I do know about Montecarlo, and yeah, the circle stuff is a more or less canonical example - but I wanted to know more about the Markov Chains, because, again, I only know these in terms of sequence generators and I have some problems imagining how this could "solve problems" unless your problem is <i>"generate words that sorta sound like a specific language but it is just mostly gibberish"</i>.
> From that, using what you know about the area of the square and circle respectively, the ratio of "inside square but not in circle" and "inside circle" points can be used to set up an equation for the value of pi.<p>Back in like 9th grade, when Wikipedia did not yet exist (but MathWorld and IRC did) I did this with TI-Basic instead of paying attention in geometry class. It's cool, but converges hilariously slowly. The in versus out formula is basically distance from origin > 1, but you end up double sampling a lot using randomness.<p>I told a college roommate about it and he basically invented a calculus approach summing pixels in columns or something as an optimization. You could probably further optimize by finding upper and lower bounds of the "frontier" of the circle, or iteratively splitting rectangle slices in infinitum, but thats probably just reinventing state of the art. And all this skips the cool random sampling the monte carlo algorithm uses.
I’ve always loved this example. I implemented the Monte Carlo pi estimation on a LEGO Mindstorms NXT back in high school. Totally sparked my interest in programming, simulations, etc. Also the NXT’s drag-and-drop, flowchart programming interface was actually a great intro to programming logic. Made it really easy to learn real programming in later on.
A Pseudorandom Number Sequence Test Program - <a href="https://www.fourmilab.ch/random/" rel="nofollow">https://www.fourmilab.ch/random/</a><p><pre><code> Monte Carlo Value for Pi
Each successive sequence of six bytes is used as 24 bit X and Y co-ordinates within a square. If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, the six-byte sequence is considered a “hit”. The percentage of hits can be used to calculate the value of Pi. For very large streams (this approximation converges very slowly), the value will approach the correct value of Pi if the sequence is close to random. A 500000 byte file created by radioactive decay yielded:
Monte Carlo value for Pi is 3.143580574 (error 0.06 percent).</code></pre>
TIL, thanks! I asked Claude to generate a simulator [1] based on your comment. I think it came out well.<p>[1] <a href="https://claude.ai/public/artifacts/1b921a50-897e-4d9e-8cfa-0b1c32a35704" rel="nofollow">https://claude.ai/public/artifacts/1b921a50-897e-4d9e-8cfa-0...</a>
Unless the problem is "quantum mechanics", then that reduces to non-Markovian processes with unistochastic laws, and ironically, this makes for a causally local QM:<p><a href="https://arxiv.org/abs/2402.16935v1" rel="nofollow">https://arxiv.org/abs/2402.16935v1</a>
Here's a direct link to a PDF.<p><a href="http://math.uchicago.edu/~shmuel/Network-course-readings/Markov_chain_tricks.pdf" rel="nofollow">http://math.uchicago.edu/~shmuel/Network-course-readings/Mar...</a>
(2007)