15 comments

  • Twirrim1 day ago
    CXL is going to be really interesting.<p>On the positive side, you can scale out memory quite a lot, fill up PCI slots, even have memory external to your chassis. Memory tiering has a lot of potential.<p>On the negative side, you&#x27;ve got latency costs to swallow up. You don&#x27;t get distance from CPU for free (there&#x27;s a reason the memory on your motherboard is as close as practical to the CPU) <a href="https:&#x2F;&#x2F;www.nextplatform.com&#x2F;2022&#x2F;12&#x2F;05&#x2F;just-how-bad-is-cxl-memory-latency&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.nextplatform.com&#x2F;2022&#x2F;12&#x2F;05&#x2F;just-how-bad-is-cxl-...</a>. CXL spec for 2.0 is at about 200ns of latency added to all calls to what is stored in memory, so when using it you&#x27;ve got to think carefully about how you approach using it, or you&#x27;ll cripple yourself.<p>There&#x27;s been work on the OS side around data locality, but CXL stuff hasn&#x27;t been widely available, so there&#x27;s an element of &quot;Well, we&#x27;ll have to see&quot;.<p>Azure has some interesting whitepapers out as they&#x27;ve been investigating ways to use CXL with VMs, <a href="https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;wp-content&#x2F;uploads&#x2F;2023&#x2F;06&#x2F;2023-CXL-DesignTradeoffs-IEEE-Micro.pdf" rel="nofollow">https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;wp-content&#x2F;uploads&#x2F;...</a>.
    • tanelpoder1 day ago
      Yup, for best results you wouldn&#x27;t just dump your existing pointer-chasing and linked-list data structures to CXL (like the Optane&#x27;s transparent mode was, whatever it was called).<p>But CXL-backed memory can use your CPU caches as usual and the PCIe 5.0 lane throughput is still good, assuming that the CXL controller&#x2F;DRAM side doesn&#x27;t become a bottleneck. So you could design your engines and data structures to account for these tradeoffs. Like fetching&#x2F;scanning columnar data structures, prefetching to hide latency etc. You <i>probably</i> don&#x27;t want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).<p>Edit: I&#x27;ll plug my own article here - if you&#x27;ve wondered whether there were actual large-scale commercial products that used Intel&#x27;s Optane as intended then Oracle database took good advantage of it (both the Exadata and plain database engines). One use was to have low latency durable (local) commits on Optane:<p><a href="https:&#x2F;&#x2F;tanelpoder.com&#x2F;posts&#x2F;testing-oracles-use-of-optane-persistent-memory&#x2F;" rel="nofollow">https:&#x2F;&#x2F;tanelpoder.com&#x2F;posts&#x2F;testing-oracles-use-of-optane-p...</a><p>VMware supports it as well, but using it as a simpler layer for tiered memory.
      • packetlost22 hours ago
        &gt; You probably don&#x27;t want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).<p>I&#x27;d bet contested locks spend more time in cache than most other lines of memory so in practice a global lock might not be too bad.
        • tanelpoder22 hours ago
          Yep agreed, for single-host with CXL scenarios. I wrote this comment thinking about a hypothetical future CXL3.x+ scenario with multi-host fabric coherence where one could in theory put locks and control structures that protect shared access to CXL memory pools into the same shared CXL memory (so, no need for coordination over regular network at least).
      • samus16 hours ago
        DBMSs have been managing storage with different access times for decades and it should be pretty easy to adapt an existing engine. Or you could use it as a gigantic swap space. No clue whether additional kernel patches would be required for that.
    • GordonS1 day ago
      Huh, 200ns is less than I imagined; even if it is still almost 100x slower than regular RAM, it&#x27;s still around 100x faster than NVMe storage.
    • temp082616 hours ago
      I have never had to go deep into NUMA configuration personally but couldn&#x27;t it be leveraged here?
      • wmf16 hours ago
        Yes, if you want your app to be aware of CXL you can configure it as a separate NUMA node.
        • tanelpoder16 hours ago
          Optane memory modules also present themselves as separate (memory only) NUMA nodes. They’ve given me a chance to play with Linux tiered memory, without having to emulate the hardware for a VM
    • immibis22 hours ago
      What kind of motherboard, CPU, cables, switches, and end devices would I need to buy to have a CXL network?
      • afr0ck20 hours ago
        CXL uses the PCIe physical layer, so you just need to buy hardware that understands the protocol, namely the CPU and the expansion boards. AMD Genoa (e.g. EPYC 9004) supports CXL 1.1 as well as Intel Saphire Rapids and all subsequent models do. For CXL memory expansion boards, you can get from Samsung or Marvell. I got a 128 GB model from Samsung with 25 GB&#x2F;s read throughput.
      • wmf22 hours ago
        CXL networking is still in the R&amp;D stage.
    • imtringued13 hours ago
      The latency concern is completely overblown because CXL has cache coherence. The moment you do a second request to the same page it will be a cache hit.<p>I would be more worried about memory bandwidth. You can now add so much memory to your servers that it might take minutes to do a full in-memory table scan.
      • justincormack12 hours ago
        Cache lines are 64 bytes, not page size.
  • mdaniel1 day ago
    &gt; Buy From One of the Regions Below &gt; Egypt<p>:-&#x2F;<p>But, because I&#x27;m a good sport, I actually chased a couple of those links figuring that I could convert Egyptian Pound into USD but &lt;<a href="https:&#x2F;&#x2F;www.sigma-computer.com&#x2F;en&#x2F;search?q=CXL%20R5X4" rel="nofollow">https:&#x2F;&#x2F;www.sigma-computer.com&#x2F;en&#x2F;search?q=CXL%20R5X4</a>&gt; is &quot;No results&quot;, and similar for the other ones that I could get to even load
    • tanelpoder1 day ago
      Yeah I saw the same. I&#x27;ve been keeping an eye on the CXL world for ~5 years and so far it&#x27;s 99% announcements, unveilings and great predictions. But the only CXL cards a consumer&#x2F;small business can buy are some experimental-ish 64GB&#x2F;128GB cards that you can actually buy today. Haven&#x27;t seen any of my larger clients use it either. Both Intel Optane and DSSD storage efforts got discontinued after years of fanfare, from technical point of view, I hope that the same doesn&#x27;t happen to CXL.
      • afr0ck20 hours ago
        I think Meta has already rolled out some CXL hardware for memory tiering. Marvell, Samsung, Xconn and many others have built various memory chips and switching hardware up to CXL 3.0. All recent Intel and AMD CPUs support CXL.
    • sheepscreek1 day ago
      That is pretty hilarious. I wonder what’s the reason behind this. Maybe they wanted plausible deniability in case someone tried to buy it (“oh the phone lines were down, you’ll have to go there to buy one”).
      • eqvinox15 hours ago
        I think someone just forgot to delete an option somewhere and it &quot;crept in&quot;, and it really isn&#x27;t supposed to have a &quot;buy&quot; link at all at this point.
      • antonvs15 hours ago
        Ok, I rented a camel and went to the specified location, but there was nothing there but some scorpions and an asp. What gives?
  • bri3d1 day ago
    CXL is a standard for compute and I&#x2F;O extension over PCIe signaling which has been around for a few years, with a couple of available RAM boards (from SMART and others).<p>I think the main bridge chipsets come from Microchip (this one) and Montage.<p>This Gigabyte product is interesting since it’s a little lower end than most AXL solutions - so far AXL memory expansion has mostly appeared in esoteric racked designs like the particularly wild <a href="https:&#x2F;&#x2F;www.servethehome.com&#x2F;cxl-paradigm-shift-asus-rs520qa-e13-rs8u-2u-4-node-amd-epyc-server-review&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.servethehome.com&#x2F;cxl-paradigm-shift-asus-rs520qa...</a> .
    • bobmcnamara22 hours ago
      CXL seems so much cleaner than the old AMD way of plumbing an FPGA through the second CPU socket.
  • eqvinox15 hours ago
    The &quot;AI&quot; marketing on this is positively silly (and a good reflection of how weird everything has gotten in this industry.)<p>Do like the card though, was waiting for someone to make an affordable version (or rather: this <i>looks</i> affordable, I hope it will be both that and actually obtainable. CXL was kinda locked away so far…)
  • pella21 hours ago
    I’m really looking forward to GPU-CXL integration.<p>&quot;CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies&quot; <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2506.15601" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2506.15601</a>
  • trebligdivad1 day ago
    My god - a CXL product! That&#x27;s really surprising anything go that far. I&#x27;d been expecting external CXL boxes, not internal stuff.
  • nmstoker11 hours ago
    Assuming you have the requisite CPU and motherboard with this card, does the memory just appear as normal under Linux&#x2F;Windows&#x2F;whatever OS is installed? Or do you need to get special drivers or other particular software to make use of it?
  • alberth19 hours ago
    As someone not well versed in GPU and CXL, would someone mind explaining the significance of this.
    • wmf16 hours ago
      This looks like the first CXL card you could actually buy. It&#x27;s been coming soon for years. It also confirms that both Intel and AMD workstation CPUs support CXL.
  • roscas1 day ago
    That is amazing. Most consumer boards will only have 32 or 64. To have 512 is great!
    • justincormack1 day ago
      You havent seen the price of 128GB DDR5 RDIMMs, they are maybe $1300 each.<p>A lot of the initial use cases of CXL seem to be to use up lots of older DDR4 RDIMMs in newer systems to expand memory, eg cloud providers have a lot.
      • kvemkon1 day ago
        Micron DDR5-5600 for 900 Euro (without VAT, business).
    • tanelpoder1 day ago
      ... and if you have the money, you can use 3 out of 4 PCIe5 slots for CXL expansion. So that could be 2TB DRAM + 1.5TB DRAM-over-CXL, all cache coherent thanks to CXL.mem.<p>I guess there are some use cases for this for local users, but I think the biggest wins could come from the CXL <i>shared</i> memory arrays in smaller clusters. So you could, for example, cache the entire build-side of a big hash join in the shared CXL memory and let all other nodes performing the join see the single shared dataset. Or build a &quot;coherent global buffer cache&quot; using CPU+PCI+CXL hardware, like Oracle Real Application Clusters has been doing with software+NICs for the last 30 years.<p>Edit: One example of the CXL shared memory pool devices is Samsung CMM-B. Still just an announcement, haven&#x27;t seen it in the wild. So, CXL arrays might become something like the SAN arrays in the future - with direct loading to CPU cache (with cache coherence) and being byte-addressable.<p><a href="https:&#x2F;&#x2F;semiconductor.samsung.com&#x2F;news-events&#x2F;tech-blog&#x2F;cxl-memory-module-box-cmm-b&#x2F;" rel="nofollow">https:&#x2F;&#x2F;semiconductor.samsung.com&#x2F;news-events&#x2F;tech-blog&#x2F;cxl-...</a>
    • cjensen1 day ago
      Both of the supported motherboards support installation of 2TB of DRAM.
      • reilly300023 hours ago
        Presumably this is about adding more memory channels via pcie lanes. I’m very curious to know what kind of bandwidth one could expect with such a setup, as that is the primary bottleneck for inference speed.
        • Dylan1680723 hours ago
          The raw speed of PCIe 5.0 x16 is 63 billion bytes per second each way. Assuming we transfer several cache lines at a time the overhead should be pretty small, so expect 50-60GB&#x2F;s. Which is on par with a single high-clocked channel of DRAM.
  • jonhohle1 day ago
    Why did something like this take so long to exist? I’ve always wanted swap or tmpfs available on old RAM I have lying around.
    • gertrunde1 day ago
      Such things have existed for quite a long time...<p>For example:<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;I-RAM" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;I-RAM</a><p>(Not a unique thing, merely the first one I found).<p>And then there are the more exotic options, like the stuff that these folk used to make: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Texas_Memory_Systems" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Texas_Memory_Systems</a> - iirc - Eve Online used the RamSan product line (apparently starting in 2005: <a href="https:&#x2F;&#x2F;www.eveonline.com&#x2F;news&#x2F;view&#x2F;a-history-of-eve-database-server-hardware" rel="nofollow">https:&#x2F;&#x2F;www.eveonline.com&#x2F;news&#x2F;view&#x2F;a-history-of-eve-databas...</a> )
    • numpad011 hours ago
      Yeah. I can&#x27;t count how many times I&#x27;ve seen descriptions of northbridge links smelling like the author knows it&#x27;s PCIe under the hood. I&#x27;ve also seen someone explaining that it can&#x27;t be done on most CPUs unless all cache systems are turned off because (IO?)MMU don&#x27;t allow caching of MMIO addresses outside DRAM range.<p>The technical explanations for the fact that you (boolean)can&#x27;t have extra DRAM controllers on PCIe is increasingly sounding like market segmentation reasons than purely technical ones. x86 is a memory mapped I&#x2F;O platform. Why we can&#x27;t just have RAM sticks on RAM addresses.<p>The reverse of this works btw. NVMe drives can use Host Memory Buffer to cache reads and writes on system RAM - the feature that jammed and caught fire on recently rumored bad ntfs.sys incident in Windows 11.
    • kvemkon1 day ago
      I&#x27;d have rather a question why we had single (or already dual) core CPUs with dual-channel memory controller and now we have 16-core CPUs but still with only dual-channel RAM.
      • justincormack12 hours ago
        AMD EPYC has 12 channel, 24 on dual socket. AMD sell machines with 2 (consumer), 4 (threadripper), 6 (dense edge), 8 (threadripper pro) and 12 memory channels (EPYC high end). Next generation EPYC will have 16 channels. Roughly if you look at the AMD options, they give you 2 memory channels per 16 cores. CPUs tend to be somewhat limited in what bandwidth they can use, eg on Apple Silicon you cant actually consume all the memory bandwidth on the wider options just on the CPUs, its mainly useful for the GPU. DDR5 was double speed of DDR4, and speeds have been ramping up too, so there have been improvements there.
      • Dylan1680723 hours ago
        DDR1 and DDR2 were clocked 20x and 10x slower than DDR5. The CPU cores we have now are faster but not <i>that much</i> faster, and with the typical user having 8 or fewer performance cores 128 bits of memory width has stayed a good balance.<p>If you need a lot of memory bandwidth, workstation boards have DDR5 at 256-512 bits wide. Apple Silicon supports that range on Pro and Max, and Ultra is 1024.<p>(I&#x27;m using bits instead of channels because channels&#x2F;subchannels can be 16 or 32 or 64 bits wide.)
      • bobmcnamara22 hours ago
        Intel and AMD I&#x27;d reckon. Apple went wide with their busses.
        • to11mtm21 hours ago
          Well, Each Channel needs a lot of pins. I don&#x27;t <i>think</i> all 288&#x2F;262 pins need to go to the CPU, but a large number of them do, I&#x27;d wager; The old LGA 1366 (Tri-Channel) and LGA 1151 (Dual Channel) are probably as close as we can get to a simple reference point [0].<p>Apple FBOW, based on a quick and sloppy count of a reballing jig [1], has something on the order of 2500-2700 balls on an M2 CPU.<p>I think AMD&#x27;s FP11 &#x27;socket&#x27; (it&#x27;s really just a standard ball grid array) pinout is something on the order of 2000-2100 balls and that gets you four 64 Bit DDR channels (I think Apple works a bit different and uses 16 bit channels, thus the &#x27;channel count&#x27; for an M2 is higher.)<p>Which is a roundabout way of saying, AMD and Intel probably <i>can</i> match the bandwidth but to do so likely would require moving to soldered CPUS which would be a huge paradigm shift for all the existing boardmakers&#x2F;etc.<p>[0] - They do have other tradeoffs; namely that 1151 has built in PCIE, on the other hand the link to the PCH is AFAIR a good bit thinner than the QPI link on the 1366.<p>[1] - <a href="https:&#x2F;&#x2F;www.masterliuonline.com&#x2F;products&#x2F;a2179-a1932-cpu-reballing-jig?VariantsId=10441" rel="nofollow">https:&#x2F;&#x2F;www.masterliuonline.com&#x2F;products&#x2F;a2179-a1932-cpu-reb...</a> . I counted ~55 rows along the top and ~48 rows on the side...
          • bobmcnamara5 hours ago
            Completely agree, and this is a bit of a ramble...<p>I think part of might be that Apple recognized that integrated GPUs require a lot of bulk memory bandwidth. I noticed this with their tablet derivative cores having memory bandwidth that tended to scale with screen size but Samsung and Qualcomm didn&#x27;t bother for ages. And it sucked doing high speed vision systems on their chips because of it.<p>For years Intel had been slowly beefing up the L2&#x2F;L3&#x2F;L4.<p>M1Max is somewhere between Nvidia 1080 and 1080TI in bulk bandwidth. The lowest end M chips aren&#x27;t competitive, but near everything above that overlaps even the current gen NVIDA 4050+ offerings
      • christkv23 hours ago
        Check out Strix Halo 395+ it’s got 8 memory channels up to 128 GB and 16 cores
        • Dylan1680723 hours ago
          That&#x27;s a true but misleading number. It&#x27;s the equivalent of &quot;quad channel&quot; in normal terms.
      • kmeisthax16 hours ago
        [dead]
    • aidenn023 hours ago
      (S)ATA or PCI to DRAM adapters were widely available until NAND became cheaper per bit than DRAM, at which point the use for it kind of went away.<p>IIRC Intel even made a DRAM card that was drum-memory compatible.
    • Dylan1680723 hours ago
      RAM controllers are expensive enough that it&#x27;s rarely worth pairing them with old RAM lying around.
  • nottorp14 hours ago
    For every gold rush, make and sell shovels.
  • JonChesterfield23 hours ago
    I don&#x27;t get it. The point of (ddrN) memory is latency. If its on the far side of pcie latency is much worse than the system memory. In what sense is this better than ssd on the far side of pcie?
    • wmf23 hours ago
      It&#x27;s only ~2x worse latency than main memory but 100x lower than SSD.
      • JonChesterfield22 hours ago
        I&#x27;m finding ~50ns best case for pcie, ~10ns for system. Which is a lot closer than I expected.
        • adgjlsfhk119 hours ago
          no system ram is 10ns. That&#x27;s closer to L2 cache.
    • Cr817 hours ago
      pcie devices can also do direct transfers to each other - if you have one of these and a gpu its relatively quick to move data between them without bouncing through main ram
  • amirhirsch1 day ago
    The i in that logo seems like it’s hurting the A
  • jauntywundrkind22 hours ago
    I wonder whose controller they are using.<p>For a memory controller, that thing looks hot!
    • marcopolis22 hours ago
      From the manual, it looks like a Microchip PM8712 [1].<p>[1] PDF Data sheet: <a href="https:&#x2F;&#x2F;ww1.microchip.com&#x2F;downloads&#x2F;aemDocuments&#x2F;documents&#x2F;DCS&#x2F;ProductDocuments&#x2F;Brochures&#x2F;SMC-2100-Smart-Memory-Controllers-00005431.pdf" rel="nofollow">https:&#x2F;&#x2F;ww1.microchip.com&#x2F;downloads&#x2F;aemDocuments&#x2F;documents&#x2F;D...</a>
  • fithisux14 hours ago
    The Amiga approach resurrected.