48 comments

  • simonw19 hours ago
    I was intrigued to see how the demo GIF in the README was generated: <a href="https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;blob&#x2F;01e75b87ecc893bbba7943c630610112b211964d&#x2F;docs&#x2F;static&#x2F;demo.gif" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;blob&#x2F;01e75b87ecc893bbba7943c63...</a><p>Turns out it&#x27;s using another project by the same author: <a href="https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;ascii-gif" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;ascii-gif</a><p>The script used for the demo is at <a href="https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;blob&#x2F;01e75b87ecc893bbba7943c630610112b211964d&#x2F;docs&#x2F;demo&#x2F;kage.tape" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;blob&#x2F;01e75b87ecc893bbba7943c63...</a> and has a comment showing how to run it:<p><pre><code> ascii-gif render docs&#x2F;demo&#x2F;kage.tape -o docs&#x2F;static&#x2F;demo.gif </code></pre> Looks like it&#x27;s an opinionated wrapper around <a href="https:&#x2F;&#x2F;github.com&#x2F;charmbracelet&#x2F;vhs" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;charmbracelet&#x2F;vhs</a>
    • vqtska16 hours ago
      You can also do an animated svg which is way smaller than a gif because it&#x27;s just text keyframes (<a href="https:&#x2F;&#x2F;github.com&#x2F;vytskalt&#x2F;pseudoc&#x2F;blob&#x2F;main&#x2F;assets&#x2F;factorial.svg" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;vytskalt&#x2F;pseudoc&#x2F;blob&#x2F;main&#x2F;assets&#x2F;factori...</a>)
      • embedding-shape14 hours ago
        Very cool, never thought of that! &quot;way smaller&quot; is almost an understatement, when it&#x27;s 50kb :P Neat that it loads in GitHub READMEs as well, which is probably a large reason people use .gif today.
      • Noumenon7211 hours ago
        How can you do it? I don&#x27;t see an SVG output from ascii-gif.
        • vqtska5 hours ago
          I used a different project, <a href="https:&#x2F;&#x2F;github.com&#x2F;marionebl&#x2F;svg-term-cli" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;marionebl&#x2F;svg-term-cli</a>
        • LocoPadre7 hours ago
          it might be this: <a href="https:&#x2F;&#x2F;github.com&#x2F;mrmarble&#x2F;termsvg" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mrmarble&#x2F;termsvg</a>
    • jubilanti17 hours ago
      Have you heard the good news about the terminal savior asciinema -- <a href="https:&#x2F;&#x2F;asciinema.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;asciinema.org&#x2F;</a>
      • embedding-shape15 hours ago
        It&#x27;s a cool tool&#x2F;platform, but very different. Asciinema tries to make the &quot;multimedia&quot; itself better by making it actual text instead of being video&#x2F;images, while the CLI command above turns actual text into multimedia supported by platforms already. Both are useful, both have their use cases :)
    • tamnd16 hours ago
      I have a bunch of opinionated&#x2F;personal-use binaries like this in my $HOME&#x2F;bin&#x2F;, like delete-all-npm, clean-rust-cache, download-youtube-playlist, and get-markdown &lt;url&gt;. It feels good, and I don&#x27;t need to remember any commands. Sometimes my coding agent can figure out how to call some of those tools too ;))
    • stavros16 hours ago
      VHS is fantastic for scripting cli video generation.
    • alterom18 hours ago
      FYI, on other platforms (Windows&#x2F;MacOS), LiceCAP is a fantastic tool to record screen into compact GIFs by the author of Winamp and Reaper DAW:<p><a href="https:&#x2F;&#x2F;www.cockos.com&#x2F;licecap&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.cockos.com&#x2F;licecap&#x2F;</a>
    • stellamariesays43 minutes ago
      [flagged]
  • wolttam20 hours ago
    One use I&#x27;d have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that&#x27;s useful at sites that don&#x27;t have cellular coverage).<p>Cool!<p>It <i>would</i> be especially cool to have a version that didn&#x27;t require the separate serving process - even though it&#x27;s nifty you can package up a whole site as a single binary.<p>Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site&#x27;s content?
    • tamnd20 hours ago
      Submitting this to Hacker News is the right place! Thanks for your idea. I will consider implementing that :)<p>Also, in my mind, I already have a script&#x2F;program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
      • d3Xt3r8 hours ago
        I&#x27;d like to request something between what GP suggested and what your program is doing currently - basically I still want a single binary, but instead of embedding a full browser in it, I would like the binary to be just a self-extracting archive that calls the user&#x27;s default browser, maybe in a new window&#x2F;frame.<p>Basically I&#x27;m looking for something like the old-school .chm files on Windows, where you could pack a bunch of HTML documents into a single archive and open it without needing to embed a full browser engine.<p>This would have the advantage of keeping the file sizes really small. And you don&#x27;t have to worry about the browser engine become outdated and potentially becoming an attack vector.
        • Bad_CRC3 hours ago
          I instantly searched for chm on the comments and yours was the only one :o
          • samat2 hours ago
            You are not alone<p>For the younger generation <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Microsoft_Compiled_HTML_Help" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Microsoft_Compiled_HTML_Help</a>
      • mgiampapa17 hours ago
        I think the zim flow was perfect for offline use. I know I will be making use of it as soon as I can figure out how to pass chrome the cookies so I can be signed into the site. Didn&#x27;t see it in the page, but I didn&#x27;t look closely yet.
        • tamnd15 hours ago
          Not yet supporting cookies, since I created this tool for shadowing public websites first. I will add options to pass cookies later. It will pass them to the underlying Chrome&#x2F;Chromium process, so it should not be hard to do.
      • mcdonje12 hours ago
        Not to load you up with too many ideas, but a markdown folder sounds a lot like obsidian, which has a plugin system now.<p>Epub would also be a great target.
      • smeej13 hours ago
        I would use the shit out of this. I&#x27;m a heavy user of Logseq (OG, the md file-based version). Would LOVE to save my favorite web resources this way.
    • gwern13 hours ago
      &gt; Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site&#x27;s content?<p>So something like SingleFileZ <a href="https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;SingleFileZ" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;SingleFileZ</a> or Gwtar <a href="https:&#x2F;&#x2F;gwern.net&#x2F;gwtar" rel="nofollow">https:&#x2F;&#x2F;gwern.net&#x2F;gwtar</a> ?
  • xlii7 hours ago
    &gt; No tracking, no network calls, no surprises.<p>Won&#x27;t comment on a project (though idea seems interesting) but this in README is a tell for me ;)
    • xd193636 minutes ago
      It&#x27;s not just no tracking — It&#x27;s no surprises.
  • ninalanyon19 hours ago
    &gt; kage serve $HOME&#x2F;data&#x2F;kage&#x2F;paulgraham.com<p>If the result is static why does it need a server? Isn&#x27;t it possible to make it so that it can simply be opened by the browser? Like:<p>$ firefox $HOME&#x2F;data&#x2F;kage&#x2F;paulgraham.com<p>Then the result would be useable on machines without kage nstalled.
    • tamnd16 hours ago
      You could use python -m http.server instead. I haven&#x27;t tried it yet, but it should work.<p>Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome&#x2F;Chromium, and a pack&#x2F;serve component that packages the result as either a ZIM file for Kiwix or an executable file.
    • doctoboggan19 hours ago
      Usually JavaScript is blocked when you load pages that way.
      • dmazzoni18 hours ago
        Not all JavaScript, but a lot of APIs are restricted
      • pixelatedindex18 hours ago
        I thought all the JS was stripper?
      • embedding-shape18 hours ago
        Since when? You won&#x27;t be able to make HTTP requests to localhost, as it&#x27;d be a different Origin, but I don&#x27;t think any mainstream browser blocks JS outright when you use file:&#x2F;&#x2F; to load and view HTML files.
        • rzzzt18 hours ago
          Somewhere around 2019, each document loaded from file:&#x2F;&#x2F; became its own origin in Firefox: <a href="https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1500453" rel="nofollow">https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1500453</a> (I didn&#x27;t check when this happened in Chromium)<p>Related WHATWG discussion: <a href="https:&#x2F;&#x2F;github.com&#x2F;whatwg&#x2F;html&#x2F;issues&#x2F;3099" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;whatwg&#x2F;html&#x2F;issues&#x2F;3099</a>
          • embedding-shape15 hours ago
            Yeah, but that&#x27;s fine, the document is .html, and it can load .&#x2F;app.js or .&#x2F;style.css just fine even if loaded by file:&#x2F;&#x2F; (as long as it isn&#x27;t initiated by JS itself, then Origin starts to matter a lot more), otherwise basically every single local HTML file would suddenly be broken, I don&#x27;t think anyone would have accepted that even with the origin changes.
            • dncornholio8 hours ago
              React and Angular are completely broken through file:&#x2F;&#x2F;
              • embedding-shape5 hours ago
                I don&#x27;t know about Angular but React works perfectly fine through file:&#x2F;&#x2F;. I&#x27;d think the bundler&#x2F;packager matter more than whar JS libraries you use, you sure you&#x27;re not actually thinking of something else not handling file:&#x2F;&#x2F; properly?
      • recursive18 hours ago
        I am quite familiar with this and it is factually false
        • danielheath17 hours ago
          Js modules don’t work on file urls (classic js does).
          • recursive12 hours ago
            They can be made to work with blob urls. I have done this.
    • afavour18 hours ago
      You’ll likely run into a ton of CORS issues doing that.
      • embedding-shape18 hours ago
        I don&#x27;t think so, there is no HTTP requests being done from JS as it&#x27;s stripped away, and all the other resources are pulled down (and I&#x27;m assume their reference made relative), so really shouldn&#x27;t be any issues because of CORS at all.
  • maxloh21 hours ago
    I find SingleFile [0] to be a much more robust version of this.<p>It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.<p>They also offer a CLI powered by Puppeteer. [1]<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;singlefile" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;singlefile</a><p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;single-file-cli" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;single-file-cli</a>
    • tamnd21 hours ago
      It seems this repo only saves one web page?<p>What I&#x27;m implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
      • maxloh19 hours ago
        Oh, I see. In that case, feature-wise, it is actually a modern alternative to HTTrack.<p>I think the misunderstanding stems from the browser&#x27;s &quot;Save As&quot; reference in the description. It is misleading. You use &quot;Save As&quot; to save a single page, not an entire website.<p>Also, the description lacks a clear explanation of the project&#x27;s purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.
      • nikisweeting14 hours ago
        Singlefile supports scoped recursive crawls too: <a href="https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;single-file-cli#:~:text=and%20crawl%20its-,internal%20links,-with%20the%20query" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;single-file-cli#:~:text=an...</a><p>I highly recommend reading the singlefile source or <a href="https:&#x2F;&#x2F;archiveweb.page&#x2F;" rel="nofollow">https:&#x2F;&#x2F;archiveweb.page&#x2F;</a> to see how they handle closed shadow DOMs, cross-origin iframes, websockets, media urls, deduping large assets, etc.
      • sillysaurusx14 hours ago
        &gt; For example, all essays from paulgraham.com<p>Not the same thing, but I made a clone of pg’s website which can be used for exactly that: <a href="https:&#x2F;&#x2F;github.com&#x2F;shawwn&#x2F;pg" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;shawwn&#x2F;pg</a><p><a href="https:&#x2F;&#x2F;shawwn.github.io&#x2F;pg&#x2F;" rel="nofollow">https:&#x2F;&#x2F;shawwn.github.io&#x2F;pg&#x2F;</a><p>If you want to read all essays, just clone the repo and open any of the .html files. Or any of the .page files which generated them.
      • sdevonoes20 hours ago
        [flagged]
        • sermah20 hours ago
          Um. Whose website are you on right now?
          • ivangelion20 hours ago
            Don&#x27;t come here to laugh but always great when it happens anyways.
    • wamatt18 hours ago
      Love love love SingleFile too. The FF extension works pretty well for a clean save.<p>That said, Kage looks promising if OP can combine SingleFile reproduction quality with the HTTPTrack spidering approach. SPA&#x27;s are kinda tricky with archiving and do wonder how well Kage would handle that
      • initramfs17 hours ago
        I&#x27;ve seen the option in IE- .mhtml.<p>For some reason it displays in IE better but I don&#x27;t recall seeing this option in chrome of Firefox recently..
    • tamnd21 hours ago
      And thanks for the link. Let me implement this single HTML feature, it looks nice to have!
      • maxloh19 hours ago
        Yeah. An idea on top of that is to bundle an entire website into a single HTML page, with vendored JavaScript to enable client-side routing (all of the original pages&#x27; JS is still stripped out).<p>That way, the page is self-contained as it is, but requires no bundled binary code to serve the site. It is actually safer security-wise.<p>The vendored script can be as simple as this:<p><pre><code> const site = { &quot;path-1&quot;: &quot;&lt;!DOCTYPE html&gt;&lt;html&gt; ... &lt;&#x2F;html&gt;&quot;, &quot;path-2&quot;: &quot;&lt;!DOCTYPE html&gt;&lt;html&gt; ... &lt;&#x2F;html&gt;&quot;, &#x2F;&#x2F; More paths } function attachListeners() { for (const [path, html] of Object.entries(site)) { document.querySelector(`a[href=${path}]`).onclick = () =&gt; { document.documentElement.outerHTML = html attachListeners() } } } document.addEventListeners(&quot;DOMContentLoaded&quot;, attachListeners)</code></pre>
    • HelloUsername20 hours ago
      What&#x27;s the difference with, any webbrowser on a computer, File -&gt; Save as ?
      • nmstoker20 hours ago
        That&#x27;s for a single page, this handles the whole site. Also the browser Save As options often work poorly.
      • dmazzoni18 hours ago
        Save As works fine for simple websites with static content.<p>Let&#x27;s say you have a site that fetches content from a database. If you Save As, then at best you&#x27;ll get a local copy of an HTML page with JS that loads the content from the same remote database. It might not work (since the local copy has a different origin), or if it does, it requires you to be online, which defeats half of the purpose.<p>What this project, and SingleFile, both do is save a snapshot of what the rendered page actually looks like at that moment in time. The scripts are stripped out so it runs locally and has no external dependencies.
    • arikrahman18 hours ago
      This is what I first thought and it&#x27;s a very elegant solution, and not needlessly overcomplicated.
  • telesilla20 hours ago
    I&#x27;ve been using httrack (<a href="https:&#x2F;&#x2F;www.httrack.com" rel="nofollow">https:&#x2F;&#x2F;www.httrack.com</a>) to download wikis to read on flights, which isn&#x27;t perfect but better than I&#x27;d found previously. I&#x27;ll try this out, I&#x27;d be delighted to have good results. Thanks for the post.
    • throwaway21945017 hours ago
      Specifically for wikis, is there a reason you wouldn&#x27;t use Kiwix? For non &quot;official&quot; releases it&#x27;s more complicated, but there are some services to generate the ZIM files. The desktop reader app is pretty good in my experience.<p><a href="https:&#x2F;&#x2F;wiki.openzim.org&#x2F;wiki&#x2F;Build_your_ZIM_file" rel="nofollow">https:&#x2F;&#x2F;wiki.openzim.org&#x2F;wiki&#x2F;Build_your_ZIM_file</a><p>EDIT: <a href="https:&#x2F;&#x2F;get.kiwix.org&#x2F;en&#x2F;solutions&#x2F;applications&#x2F;kiwix-reader&#x2F;" rel="nofollow">https:&#x2F;&#x2F;get.kiwix.org&#x2F;en&#x2F;solutions&#x2F;applications&#x2F;kiwix-reader...</a>
      • tamnd15 hours ago
        Kiwix has readers for almost every platform, Android, desktop, iPhone. That&#x27;s why I made Kage produce ZIM file.<p>The executable file is mostly for people who don&#x27;t have Kiwix installed yet, or just want to run the archive directly.
      • telesilla16 hours ago
        Thanks, never knew about this and great to hear about it.
    • tamnd15 hours ago
      This brings back memories. Around twenty years ago, internet was still expensive dial-up, so I used to go to an internet cafe, run HTTrack to download websites and manga, copy everything onto my tiny 128MB USB stick (felt very large at that time), then bring it home and read offline ;))
    • nikisweeting16 hours ago
      <a href="https:&#x2F;&#x2F;github.com&#x2F;archiveteam&#x2F;grab-site" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;archiveteam&#x2F;grab-site</a> or browsertrix may be easier to use for some, it&#x27;s what was used to save a lot of the data.gov stuff before it got taken down.
  • gregwebs21 hours ago
    This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images&#x2F;videos? Is there a way to only get a subset of a website?
    • tamnd20 hours ago
      Could you help create a new issue for that? I will do it later. It is already 1:00 AM my time, but I am happy that anyone is interested in it. : )
    • ares62317 hours ago
      Just pretend you&#x27;re an AI crawler problem solved
  • dimiprasakis20 hours ago
    Neat project, I like the idea. One thing from a quick read: you launch Chrome with --no-sandbox. Is there a good reason for that? Security wise it&#x27;s probably not a good idea. If there is no reason, I&#x27;d suggest leaving the sandbox on!<p>In any case, cool stuff :)
    • nikisweeting16 hours ago
      --no-sandbox is needed in docker, maybe they assume it will mostly run in docker?
      • tamnd15 hours ago
        Exactly. For downloading, Kage requires Chrome or Chromium. Running it inside Docker makes setup easier and keeps cleanup simple:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;blob&#x2F;main&#x2F;Dockerfile" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;blob&#x2F;main&#x2F;Dockerfile</a><p>Btw, let me think the way to only enable this when running inside Docker.
        • nikisweeting14 hours ago
          Docker is designed to be undetectable by default, the best way I have found is to set env IN_DOCKER=True manually in your Dockerfile + check that there is no $DISPLAY configured + that you&#x27;re on linux. Usually if all&#x2F;most of those are true you can safely add --no-sandbox --disable-setuid-sandbox --disable-dev-shm-usage etc. all the docker-specific flags. Thats what we do in <a href="https:&#x2F;&#x2F;github.com&#x2F;ArchiveBox&#x2F;ArchiveBox&#x2F;blob&#x2F;dev&#x2F;Dockerfile#L46" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ArchiveBox&#x2F;ArchiveBox&#x2F;blob&#x2F;dev&#x2F;Dockerfile...</a>
          • dimiprasakis3 hours ago
            Cool approach.<p>But, a compromise still lands on host&#x27;s kernel, Docker doesn&#x27;t provide kernel isolation (well it does on a macOS because it runs in Docker machine but thats a side effect).<p>I wonder if a better solution would be to play with seccomp or Linux capabilities so that Chrome is sandboxed even in Docker. Not sure how this would work tbh.<p>Answering here to get ideas, I saw your fix on Git and request for feedback (will try to review and give it some thought once I find some time)
          • tamnd8 hours ago
            It should be fixed by <a href="https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;pull&#x2F;12" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;kage&#x2F;pull&#x2F;12</a><p>Thanks for nice trick.
  • coffeecoders18 hours ago
    I&#x27;ve accumulated a bunch of old website archives over the years. The funny thing is the ugly HTML dumps have been more useful than the &quot;perfect&quot; archive.<p>It&#x27;s one of the reasons I&#x27;ve become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
    • couscouspie2 hours ago
      Maybe it is just me, but by far most of the time, when I want to archive something from the internet, it is information and information is best served in an absolutely minimal text format like html or md.
    • tamnd15 hours ago
      I have a project for creating and archiving RSS feeds, keeping the full history from the time the crawler starts. I need to clean up a bit, then will open source it soon.
  • shinryuu18 hours ago
    Reminds me of this. <a href="https:&#x2F;&#x2F;gwern.net&#x2F;gwtar" rel="nofollow">https:&#x2F;&#x2F;gwern.net&#x2F;gwtar</a><p>Compared to that is there anything kage does better?
  • kadhirvelm15 hours ago
    This is awesome, we wanted an offline copy of someone’s prototype (as built on Lovable, etc) so we could do version control and sharing in an easier format. Wrote our approach here: <a href="https:&#x2F;&#x2F;productnow.ai&#x2F;blogs&#x2F;extracting-html-from-ai-prototyping-tools" rel="nofollow">https:&#x2F;&#x2F;productnow.ai&#x2F;blogs&#x2F;extracting-html-from-ai-prototyp...</a><p>But will look into this now, see if we can swap some stuff out. We’ve really liked the idea of an offline mirror, makes a lot of collaboration use cases simpler
  • rahimnathwani20 hours ago
    So this is like using wget --mirror except that it works on pages that require javascript, right?
    • tamnd20 hours ago
      Yeah, it is. For example, openai.com is rendered with Next.js, so I will try to mirror it tomorrow.
  • sails9 hours ago
    What is the best way to give coding agent a full website so that it can see what I see? With animation and design I’m never sure what it gets when I save the website in the browser. Maybe this is suitable?
  • lolpython20 hours ago
    This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet
  • Sathwickp3 hours ago
    I&#x27;m still trying to cope with your github profile, 68k commits a year is crazyy
  • sanqui20 hours ago
    Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.
    • tamnd20 hours ago
      I&#x27;m working on WARC too, with format from Common Crawl!<p>By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: <a href="https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;ccrawl-cli" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tamnd&#x2F;ccrawl-cli</a>
      • sanqui20 hours ago
        That&#x27;s neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage&#x27;s crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format.
        • tamnd20 hours ago
          I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support.<p>For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!
          • sanqui20 hours ago
            I&#x27;m a fan of compatibility with established formats!<p>Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I&#x27;m interested in your project!
          • Prime_Axiom19 hours ago
            Looking forward to the next project! I love these kinds of archiving tools.
          • threecheese16 hours ago
            OK, sounds fascinating; following! (your GH)
            • tamnd15 hours ago
              Thanks ;)
    • Dhavidh20 hours ago
      sound interesting
  • amatecha14 hours ago
    Suddenly remembering the days of dialup and your browser serving a fully-functional cached copy of a webpage when you try to access it and you&#x27;re not online...
  • c7b4 hours ago
    Probably a stupid question, but could this archive embedded videos as well?
  • ekianjo1 hour ago
    Curious about &quot;keep it for a decade&quot; claim. Can something possibly break down the road?
  • Igor_Wiwi20 hours ago
    This is quite useful tool, especially for the cases where internet access is limited (the flights for example). I implemented it as a separate feature in mdview.io: for example you can export a document as a html file for offline usage, with all the presentation features like reach tables, mermaid and etc built in. Example <a href="https:&#x2F;&#x2F;mdview.io&#x2F;s&#x2F;why-markdown-became-default-format-for-ai" rel="nofollow">https:&#x2F;&#x2F;mdview.io&#x2F;s&#x2F;why-markdown-became-default-format-for-a...</a> then try to Export - Export HTML
  • latexr19 hours ago
    For those with an eReader, one thing that works really well is using pandoc to download and convert a webpage to EPUB that you can then load to your reader.<p><pre><code> pandoc --from html --to epub --output &#x2F;PATH&#x2F;TO&#x2F;FILE.epub https:&#x2F;&#x2F;example.com</code></pre>
    • arikrahman18 hours ago
      Thanks, will try this out on the Kobo later.
  • jyscao14 hours ago
    I tried to clone a HTTP (not HTTPS) site, and it&#x27;s giving me `navigation failed: net::ERR_NAME_NOT_RESOLVED`. Even when I explicitly included the protocol with `http:&#x2F;&#x2F;&lt;FQDN&gt;`.
  • snowflaxxx2 hours ago
    Meet Teleport Pro
  • carsonye11 hours ago
    This is interesting. Is the intended use case mostly read-only websites like blogs&#x2F;docs&#x2F;essays? How well does it handle sites where navigation, search, dropdowns, or other UI interactions depend on JavaScript?
  • godot14 hours ago
    the readme uses paulgraham.com as an example (which is text articles mostly) and I never use &quot;Save As&quot; for a web page (for the reasons the author states), I always just print as PDF and save the PDF file.<p>for an entire website though of many pages I can see this can be useful.
  • smusamashah11 hours ago
    What if I wanted to download all Confluence docs at work?
  • endorphine8 hours ago
    Anyone remembers Teleport Pro?
  • calrizien17 hours ago
    Does this work for the Apple Docs website? Really tricky to get those offline.
    • tamnd15 hours ago
      Making docs available offline was one of my main motivations for building this tool. I will try Apple Docs too.<p>I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.<p>By the way, I forgot to add zstd compression support to my ZIM reader&#x2F;writer. I will implement that in the next version.
  • rickylin11 hours ago
    It seems like <a href="https:&#x2F;&#x2F;github.com&#x2F;tw93&#x2F;pake" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tw93&#x2F;pake</a> is better.
    • italiancheese11 hours ago
      Both of these projects have completely different purposes and use cases.<p>Have you even read the first line of the readme of the project you&#x27;re commenting on?
  • nitotm17 hours ago
    I was looking for something like this the other day, it can be very helpful.
  • G_o_D12 hours ago
    How its different then MHTML ??
  • daviding20 hours ago
    Nice idea! fwiw, false positives and all, but the Windows 11 default Windows Security doesn&#x27;t like it: `leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.`
  • KellyCriterion18 hours ago
    Sounds like .MCH-files re-invented? (-:
  • chinnyys19 hours ago
    The readme is AI slop, and incredibly grating to read. The disgust I felt while reading it almost put me off trying the project.<p>Is the code also AI slop?
  • chfritz16 hours ago
    how is this different from using puppeteer to load the page and save the DOM as HTML?
  • cynicalsecurity17 hours ago
    Binary app is a really bad way of storing data. No one would ever want to run a binary shared with them or found online.
    • tamnd15 hours ago
      For sharing, better use the html folder or zim format, Kage supports both of them.
  • jokethrowaway9 hours ago
    Amazing stuff!<p>I would recommend an add-on or new feature to detect and remove cookie banners &#x2F; annoying popups that open on load (eg. sign up to my mailing list).<p>listing a few examples form fastText could help you.<p>You might also have the opposite problem though: some websites have content in the base html (so it&#x27;s searchable by Google and they get views) and remove it on load (so you have to pay).<p>Capturing the initial html and comparing it to the final version could give you some hints and allow you to repair the removed content.<p>Best of luck with the project!
  • soulofmischief18 hours ago
    Cool project! I know it&#x27;s written in go, but it would be cool to see something like this which uses Cosmopolitan Libc + redbean or something similar to create a binary which runs anywhere. Would be fun to be able to pass around self-executable website archives.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;jart&#x2F;cosmopolitan" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jart&#x2F;cosmopolitan</a><p><a href="https:&#x2F;&#x2F;justine.lol&#x2F;cosmopolitan&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;justine.lol&#x2F;cosmopolitan&#x2F;index.html</a><p><a href="https:&#x2F;&#x2F;redbean.dev" rel="nofollow">https:&#x2F;&#x2F;redbean.dev</a><p>(Certificates just expired for justine&#x27;s website, just ignore the warning.)
    • tamnd15 hours ago
      This could be a nice code golf project. It only needs a webview, a ZIM reader, and a way to append data to an existing binary and read it back.<p>I did something like that a very long time ago (Of course, I have forgotten)
    • jokethrowaway9 hours ago
      I never understood the appeal for cosmopolitan.<p>I&#x27;d rather have platform specific minimal binaries than a single binary with hacks.<p>Installing packages is a solved problem
      • soulofmischief4 hours ago
        Installing packages is a completely different activity than passing around self-executable archives among friends. Not everything needs to go through a CI pipeline and distribution platform before you can share it with others. On top of that, I really enjoy being able to write quick little utilities and then pass them around without worrying about what operating system anyone who stumbles upon it has.<p>It&#x27;s fine if you don&#x27;t personally find it useful for your workflow, but I think it&#x27;s mad cool, especially since you can zip together multiple binaries into one, along with data.
  • aa-jv3 hours ago
    I&#x27;ve been using &quot;Print to PDF&quot; as my principle bookmarks management tool, since 1998, and I have over 90,000+ such PDF&#x27;s sitting on my system, easily re-read and discovered.<p>So I don&#x27;t quite get whats the point of kage? What does it do that print-to-PDF won&#x27;t already do? The resulting .pdf&#x27;s contain all the content, and also include the original URL and creation date, etc. How is kage an improvement?
  • delduca20 hours ago
    curl can do this
  • sneak17 hours ago
    The README is LLM slop. This makes me assume the code is the same.
  • Onavo18 hours ago
    How does it handle websites with client side paywalls? Can you run it with extensions like bypass paywalls and ublock origin?
  • grahamstanes1720 hours ago
    nice
  • netdevphoenix1 hour ago
    [dead]
  • k4rnaj1k17 hours ago
    [dead]
  • eventinbox13 hours ago
    [flagged]