26 comments

  • dginev21 hours ago
    Hi, an arXiv HTML Papers developer here.<p>As a very brief update - we are pending a larger update.<p>You will spot many (many) issues with our current coverage and fidelity of the paper rendering. When they jump at you, please report them to us. All reports from the last 2 years have landed on github. We have made a bit of progress since, but there are (a lot of) more low-hanging fruit to pick.<p>Project issues:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;arXiv&#x2F;html_feedback&#x2F;issues&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;arXiv&#x2F;html_feedback&#x2F;issues&#x2F;</a><p>The main bottleneck at the moment is developer time. And the main vehicle for improvements on the LaTeX side of things continues to be LaTeXML. Happy to field any questions.
    • istillwritecode54 minutes ago
      I would like to write code for latexml to translate a package but I found the documentation to be hard to understand. That might be what is holding developers back. I looked at this a year ago and gave up.
  • RandyOrion17 hours ago
    For arXiv papers, I prefer HTML format much more than PDF format.<p>Compared to PDF format, HTML format is much more accessible because of browsers. Basically I can reuse my browser extensions to do anything I like without hassle, like translation, note taking, sending texts to LLMs, and so on.<p>For now, arXiv offers two HTML services: the default one in <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;xxxx.xxxxx" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;xxxx.xxxxx</a> , and the alternative one in <a href="https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;html&#x2F;xxxx.xxxxx" rel="nofollow">https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;html&#x2F;xxxx.xxxxx</a> , here &#x27;x&#x27; is a placeholder for a number or digit.<p>The most glaring problem of the default HTML service is the coverage of papers. Sometimes it just doesn&#x27;t work, e.g., <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2505.06708" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2505.06708</a> . The solution may be switch to alternative HTML service, e.g., <a href="https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;html&#x2F;2505.06708" rel="nofollow">https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;html&#x2F;2505.06708</a> .<p>Note that alternative HTML service also has coverage problem. Sometimes both HTML services fail, e.g. <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2511.22625" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2511.22625</a> .
  • ComputerGuru1 day ago
    If the Unicode consortium would spend less time and effort on emoji and more on making the most common&#x2F;important mathematical symbols and notations available&#x2F;renderable in plain text, maybe we could move past the (LA)TeX&#x2F;PDF marriage. OpenType and TrueType now (edit: for well over a decade, actually) support the necessary conditional rendering required to perform complicated rendering operations to get sequences of Unicode code points to display in the way needed (theoretically, anyway) and with fallback missing-glyph-only font family substitution support available pretty much everywhere allowing you to seamlessly display symbols not in your primary font from a fallback asset (something like Noto, with every Unicode symbol supported by design, or math-specific fonts like Cambria Math or TeX Gyre, etc), there are no technical restrictions.<p>I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript&#x2F;subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse&#x2F;rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).<p>An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system&#x2F;user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX&#x2F;MathJax&#x2F;etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): <a href="https:&#x2F;&#x2F;x.com&#x2F;NeoSmart&#x2F;status&#x2F;1995582721327071367?s=20" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;NeoSmart&#x2F;status&#x2F;1995582721327071367?s=20</a>
    • crazygringo23 hours ago
      I don&#x27;t understand. No matter what fancy things you do with superscripts and subscripts, you&#x27;re not going to be able to do even basic things you need for equations like use a fraction bar, or parentheses that grow in height to match the content inside them.<p>At a fundamental level, Unicode is for characters, not layout. Unicode may abuse the ZWJ for emoji, but it still ultimately results in a single emoji character, not a layout of characters. So I don&#x27;t really understand what you&#x27;re asking for.
      • lukan21 hours ago
        Agreed. I think MathML is intended for layout of formulas and integrated into browsers nowdays, but I never used it, so don&#x27;t know if essentials are missing?
      • bsder19 hours ago
        &gt; No matter what fancy things you do with superscripts and subscripts, you&#x27;re not going to be able to do even basic things you need for equations like use a fraction bar, or parentheses that grow in height to match the content inside them.<p>Why not? Things like Arabic ligatures already do that, no?
        • bruce3434343 hours ago
          Arabic ligatures? Do you mean the unicode point for the basmala for instance? That&#x27;s pretty &quot;hardcoded&quot;, I think math requires more composability
        • austinjp19 hours ago
          This is interesting to me, but I am very naive about this. Can you explain, or point to where I could learn more?
          • bsder13 hours ago
            I&#x27;d start with HarfBuzz: <a href="https:&#x2F;&#x2F;github.com&#x2F;harfbuzz&#x2F;harfbuzz" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;harfbuzz&#x2F;harfbuzz</a><p>That&#x27;s <i>the</i> open source font shaping engine. It does a lot of work to handle font shaping and rendering for languages that can&#x27;t really be reduced to characters.
    • raincole18 hours ago
      Math formulas are far far far more complex than unicode emojis. I don&#x27;t even know how to start comparing them.
    • SOTGO22 hours ago
      I&#x27;m almost surprised that Gemini 3 uniquely has this problem. I would have expected that responses from any LLM that require complex math notation would almost certainly be LaTeX heavy, given the abundance of LaTeX source material in the training data. I suppose it is a flaw if a model can&#x27;t avoid LaTeX, but given that it is the standard (and for the foreseeable future too) I don&#x27;t know what appropriate output would look like. For &quot;pure&quot; mathematics or similar topics I think LaTeX (or system that represents a superset of LaTeX) is the only acceptable option.
    • franga20005 hours ago
      The whole &quot;we need latex because of math&quot; thing has been nothing more than a bad excuse for a very long time. Math notation is too varied to include in Unicode (some papers have to invent new notation!), but even if we had it, authors would still insist on latex. You can already make responsive and largely accessible papers that render to HTML, with latex familiar syntax for equations, bibtex for references and all the footnotes&#x2F;figures&#x2F;tables&#x2F;captions you might want.<p>But authors still refuse. It&#x27;s not real science if the layout isn&#x27;t two-column, written in an old serif font, tables and figures float randomly disconnected from their reference points, code isn&#x27;t syntax higlighted and has completely nonsensical line breaks... If the reader wants to read it on a phone, or needs to change to font to be larger or more legible, they&#x27;re not a real scientist and don&#x27;t deserve to read real papers.<p>Seriously, what the fuck?? Even the economists are laughing at us with their MS Word and third-party cloud-based bibliography plugin subscription.
      • gus_massa48 minutes ago
        Authors just follow any format mandated by the journals.<p>In unoficial notes for the classes, most authors use single column, and try to remember the magic spell to keep the figures in place. Something like [H!] ???<p>Also most books are single column.
    • hannahnowxyz1 day ago
      Have you tried a two-pass approach? For example, where prompt #1 is &quot;Which elliptic curves have rational parameterizations?&quot;, and then prompt #2 (perhaps to a smaller&#x2F;faster model like Gemma) is &quot;In the following text, replace all LaTeX-escaped notation with Markdown code blocks and unicode characters. For example, $F_n = F_{n - 1} + F_{n - 2}$ should be replaced with `Fₙ = Fₙ₋₁ + Fₙ₋₂`. &lt;Response from prompt #1&gt;&quot;. Although it&#x27;s not clear how you would want more complex things to be converted.
      • toastal11 hours ago
        reStructuredText support :math: roles. AsciiDoc has stem blocks. Why do folks keep trying to shoehorn Markdown into everything, creating yet another fork, when there are other lightweight markup languages that support actual features for technical blogs&#x2F;documentation?
      • baby23 hours ago
        I&#x27;ve done latex -&gt; mathml -&gt; markdown and it works quite well
      • yannis23 hours ago
        It is actually quicker to ask using LaTeX markup!
    • moelf23 hours ago
      <a href="https:&#x2F;&#x2F;github.com&#x2F;stevengj&#x2F;subsuper-proposal" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;stevengj&#x2F;subsuper-proposal</a>
  • ForceBru1 day ago
    Is this new or somehow updated? HTML versions of papers have been available for several years now.<p>EDIT: indeed, it was introduced in 2023: <a href="https:&#x2F;&#x2F;blog.arxiv.org&#x2F;2023&#x2F;12&#x2F;21&#x2F;accessibility-update-arxiv-now-offers-papers-in-html-format&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.arxiv.org&#x2F;2023&#x2F;12&#x2F;21&#x2F;accessibility-update-arxiv...</a>
    • Tagbert1 day ago
      From the paper...<p>Why &quot;experimental&quot; HTML?<p>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices. In addition to the technical challenges, the conversion must be both rapid and automated in order to maintain arXiv’s core service of free and fast dissemination.
      • ForceBru1 day ago
        No I mean _arXiv_ has had experimental support for generating HTML versions of papers for years now. If you visit arXiv, you&#x27;ll see a lot of papers have generated HTML alongside the usual PDF, so I&#x27;m trying to understand whether the article discussed any new developments. It seems like it&#x27;s not new at all
      • fooofw23 hours ago
        It&#x27;s kind of fun to compare this formulation with the seemingly contradictory official arXiv argument for submitting the TeX source [1]:<p>&gt; 1. TeX has many advantages that make it ideal as a format for the archives: It is plain text, it is compact, it is freely available for all platforms, it produces extremely high-quality output, and it retains contextual information.<p>&gt; 2. It is thus more likely to be a good source from which to generate newer formats, e.g., HTML, MathML, various ePub formats, etc. [...]<p>Not that I disagree with the effort and it surely is a unique challenge to, at scale, convert the Turing complete macro language TeX to something other than PDF. And, at the same time, the task would be monumentally more difficult if only the generated PDFs were available. So both are right at the same time.<p>[1] <a href="https:&#x2F;&#x2F;info.arxiv.org&#x2F;help&#x2F;faq&#x2F;whytex.html#contextual" rel="nofollow">https:&#x2F;&#x2F;info.arxiv.org&#x2F;help&#x2F;faq&#x2F;whytex.html#contextual</a>
        • tosti8 hours ago
          Working with both at the same time makes their strengths and pitfalls shine. It&#x27;s like that dual-boot computer where you&#x27;re constantly in the wrong OS.<p>HTML has better separation of concerns than latex. Latex does typesetting a lot better than html. HTML layout can differ wildly in the same document. Latex documents are easier to layout in the first place.<p>...etc...
      • daemonologist1 day ago
        There are pretty often problems with figure size and with sections being too narrow or wide (for comfortable reading). The PDF versions are more consistently well-laid-out.
    • inglor1 day ago
      You&#x27;re right <a href="https:&#x2F;&#x2F;github.com&#x2F;arXiv&#x2F;arxiv-docs&#x2F;blob&#x2F;develop&#x2F;source&#x2F;about&#x2F;accessible_HTML.md" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;arXiv&#x2F;arxiv-docs&#x2F;blob&#x2F;develop&#x2F;source&#x2F;abou...</a> this needs a 2023 tag @dang
  • DominikPeters1 day ago
    As an arXiv author who likes using complicated TeX constructions, the introduction of HTML conversion has increased my workload a lot trying to write fallback macros that render okay after conversion. The conversion is super slow and there is no way to faithfully simulate it locally. Still I think it&#x27;s a great thing to do.
    • xworld211 day ago
      I believe dginev&#x27;s Docker image <a href="https:&#x2F;&#x2F;github.com&#x2F;dginev&#x2F;ar5ivist" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dginev&#x2F;ar5ivist</a> is very close to what runs on arXiv and can be run locally. It uses a recent LaTeXML snapshot from September.
  • ekjhgkejhgk1 day ago
    I wish epub was more common for papers. I have no idea if there&#x27;s any real difficulties with that, or just not enough demand.
    • mmooss1 day ago
      epub is html, under the hood<p>Is there an epub reader that can format text approximately as usably and beautifully as pdf? What I&#x27;ve seen makes it noticeably harder to read longer texts, though I haven&#x27;t looked around much.<p>epub also lacks annotation, or at least annotation that will be readable across platforms and time.
    • hombre_fatal1 day ago
      Because what makes epub a format on top of html is just that someone QA&#x27;ed it and wrote the html&#x2F;css with it in mind. Especially considering things like diagrams and tables.<p>Not really what you want researchers to waste their time doing.<p>But you can use any of the numerous html-&gt;epub packagers yourself.
    • pspeter31 day ago
      Why epub? Isn’t it just HTML under the hood?
      • silon422 hours ago
        I think it should also have JS disabled (I hope!)
      • ekjhgkejhgk1 day ago
        Because I can open it on my ereader.
  • el3ctron1 day ago
    Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.
    • lalithaar1 day ago
      Hello, I was going through html versions of my preprints on Arxiv, thank you for all that you guys do Please do let me know if the community could contribute through any means for the same
      • dginev21 hours ago
        You can help make LaTeXML better, or you can simply report issues when you spot them during reading. Some we have collected automatically (any errors and missing packages), but others we can&#x27;t - wrong colors, broken aspect ratios of figures, weirdly layed out author lists, etc.
  • zipy1247 hours ago
    The biggest issue with papers for me today is that they don&#x27;t allow videos as anything other than supplemental materials to be downloaded, or linking to a web-page that has them. I want to embed gif&#x27;s or videos in my papers directly!
  • Barbing1 day ago
    &gt;Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.<p>Challenging. Good work!
  • leobg1 day ago
    It must have been around 1998. I was editor of our school’s newspaper. We were using Corel Draw. At some point, I proposed that we start using HTML instead. In the end, we decided against it, and the reasons were the same that you can read here in the comments now.
  • percentcer1 day ago
    Dumb question but what stops browsers from rendering TeX directly (aside from the work to implement it)? I assume it&#x27;s more than just the rendering
    • bo102423 hours ago
      You mean a display engine that works like an HTML renderer, except starting from TeX source instead of HTML source? I think you could get something that mostly works, but it would be a pain and at the end you wouldn&#x27;t have CSS or javascript, so I don&#x27;t think browser makers are interested.
    • pwdisswordfishy23 hours ago
      For starters, TeX is Turing-complete, and the tokenizer is arbitrarily reprogrammable at runtime.
      • fph5 hours ago
        As far as I know the Tex team has been working hard lately on supporting accessible &quot;tagged pdfs&quot;. Hopefully one day Tex&#x2F;Latex output will be accessible by default and conversion to HTML will not be needed.
      • gbear60523 hours ago
        Browsers already support JavaScript anyway, so why not add another Turing-complete language into the mix? (Not even accounting for CSS technically being Turing-complete, or WASM, or …)
      • ErroneousBosh23 hours ago
        Okay then, what would stop you rendering TeX to SVG and embedding that?<p>Edit: Genuine question, not rhetorical - I don&#x27;t know how well it would work but it sounds like it should.
        • fooofw23 hours ago
          That would (mostly if not always) work in the sense of reproducing the layout of the pages, but would defeat the purpose of preserving the semantic information present in the TeX file (what is a heading, a reference and to what, a specific math environment, etc.) which is AFAIK already mostly dropped on conversion to PDF by the latex compiler.
  • notorandit11 hours ago
    Thee problem is the viewer, not the format. We are talking about accessibility and scientific papers, where fancy animations and transitions are not core features.<p>LaTeX and TeX are the de facto standard for this context and converting all existing documents is a lot of work and energy to be spent for basically little gain, if any.
  • [Sept 2023] as per the wayback machine.
  • sega_sai1 day ago
    Unfortunately I didn&#x27;t see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they&#x27;d make some kind of &#x27;try html&#x27; button for those.
    • Do the older papers work via [Ar5iv](<a href="https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;</a>) ?<p>&gt; View any arXiv article URL [in HTML] by changing the X to a 5<p>The line<p>&gt; Sources upto the end of November 2025.<p>sounds to me like this is indeed intended for older articles.
      • dginev21 hours ago
        ar5iv tracks the arXiv collection with a one month lag. Exactly as to signal that this is not the &quot;official&quot; arXiv rendering. It is also a showcase predating the arXiv &#x2F;html&#x2F; route, but largely using the same technology. Nowadays maintained by the same people (hi!)<p>There used to be another showcase, called arxiv-vanity. They captured what happened pretty well with their farewell post on their homepage:<p><a href="https:&#x2F;&#x2F;www.arxiv-vanity.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.arxiv-vanity.com&#x2F;</a>
  • jas391 day ago
    Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy&#x2F;paste isn&#x27;t very useful
    • stephenlf1 day ago
      That doesn’t solve the accessibility issue, though. You need semantic tags.
  • constantcrying10 hours ago
    Reading this thread many people do not seem to understand what to the problem even is. What researchers writing Papers want is a low effort&#x2F;high flexibility way to write documents (Nobody wants to write their paper in HTML). For a paper to be printed it needs to be in some printable format, like PDF. To provide accessibility and accommodate the changing ways papers are read, which is increasingly online, HTML is also a desirable output.<p>What really is needed is a markup language which natively can target both PDF and HTML. This is something typst is working on, but I am not aware of any other project, which either comes close to the features of LaTeX or supports both target formats.<p>To me this is the only reasonably way to address the accessibility and usability issues around Papers. Have one markup, with sufficient accessibility features, which simultaneously targets HTML and PDF.
  • nateroling1 day ago
    Seeing the Gemini 3 capabilities, I can imagine a near future where file formats are effectively irrelevant.
    • qart1 day ago
      I have family members with health conditions that require periodic monitoring. For some tests, a phlebotomist comes home. For some tests, we go to a hospital. For some other tests, we go to a specialized testing center. They all give us PDFs in their own formats. I manually enter the data to my spreadsheet, for easy tracking. I use LLMs for some extraction, but they still miss a lot. At least for the foreseeable future, no LLM will ever guarantee that all the data has been extracted correctly. By &quot;guarantee&quot;, I mean someone&#x27;s life may depend on it. For now, doctors take up the responsibility of ensuring the data is correct and complete. But not having to deal with PDFs would make at least a part of their job (and our shared responsibilities) easier.
    • s0rce1 day ago
      Can you elaborate? Are you never reading papers directly but only using Gemini to reformat or combine&#x2F;summarize?
      • nateroling1 day ago
        I mean that when a computer can visually understand a document and reformat and reinterpret it in any imaginable way, who cares how it’s stored? When a png or a pdf or a markdown doc can all be be read and reinterpreted into an infographic or a database or an audiobook or an interactive infographic the original format won’t matter.
    • DANmode1 day ago
      Files.<p>Truth in general, if we aren&#x27;t careful.
    • doc_ick1 day ago
      [dead]
    • sansseriff1 day ago
      Seriously. More people need to wake up to this. Older generations can keep arguing over display formats if they want. Meanwhile younger undergrad and grad students are getting more and more accustomed to LLMs forming the front end for any knowledge they consume. Why would research papers be any different.
      • JadeNB1 day ago
        &gt; Meanwhile younger undergrad and grad students are getting more and more accustomed to LLMs forming the front end for any knowledge they consume.<p>Well, that&#x27;s terrifying. I mean, I knew it about undergrads, but I sure hoped people going into grad school would be aware of the dangers of making your main contact with research, where subtle details are important, through a known-distorting filter.<p>(I mean, I&#x27;d still be kinda terrified if you said that grad students <i>first</i> encounter papers through LLMs. But if it is the front end for <i>all</i> knowledge they consume? Absolutely dystopian.)
        • sansseriff1 day ago
          I admit it has dystopian elements. It’s worth deciding what specifically is scary though. The potential fallibility or mistakes of the models? Check back in a few months. The fact they’re run by giant corps which will steal and train on your data? Then run local models. Their potential to incorporate bias or persuade via misalignment with the reader’s goals? Trickier to resolve, but various labs and nonprofits are working on it.<p>In some ways I’m scared too. But that’s the way things are going because younger people far prefer the interface of chat and question answering to flipping through a textbook.<p>Even if AI makes more mistakes or is more misaligned with the reader’s intentions than a random human reviewer (which is debatable in certain fields since the latest models game out), the behavior of young people requires us to improve the reputability of these systems. (Make sure they use citations, make sure they don’t hallucinate, etc). I think the technology is so much more user friendly that fixing the engineering bugs will be easier than forcing new generations to use the older systems.
  • ashleyn1 day ago
    Can&#x27;t help but wonder if this was motivated in part by people feeding papers into LLMs for summary, search, or review. PDF is awful for LLMs. You&#x27;re effectively pigeonholed into using (PAYING for) Adobe&#x27;s proprietary app and models which barely hold a candle to Gemini or Claude. There are PDF-to-text converters, but they often munge up the formatting.
    • jrk1 day ago
      Not sure when you last tried, but Gemini, Claude, and ChatGPT have all supported pretty effective PDF input for quite a while.
  • billconan1 day ago
    I don&#x27;t think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying&#x2F;rendering.<p>the actual paper content format should be separated from its rendering.<p>i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn&#x27;t have font sizes, layout etc.<p>the viewer platforms then should be able to style the content differently.
    • cluckindan1 day ago
      HTML alone is in fact not a format for displaying&#x2F;rendering. Done properly, it is a structural representation of the content. (This is often called ”semantic HTML”.)<p>They are converting to HTML to make the content more accessible. Accessibility in this context means a11y, in effect ”more accessible” equates to ”more compatible with screen readers”.<p>While PDF documents can be made accessible, it is way easier to do it in HTML, where browsers build an actual AOM (accessibility object model) tree and expose it to screen readers.<p>&gt;it should contain abstract, sections, equations, figures, citations etc.<p>So &lt;article&gt;, &lt;section&gt;, &lt;math&gt;, &lt;figure&gt;, &lt;cite&gt;, etc.
      • o11c15 hours ago
        The hope for semantic HTML died the day they said &quot;stop using &lt;i&gt;, use &lt;em&gt;&quot;, regardless of what the actual purpose of the italics was (it&#x27;s usually not emphasis).
        • cluckindan1 hour ago
          Who said that? The semantics are different.<p>The &lt;i&gt; HTML element represents a range of text that is set off from the normal text for some reason, such as idiomatic text, technical terms, taxonomical designations, among others. Historically, these have been presented using italicized type, which is the original source of the &lt;i&gt; naming of this element.<p>The &lt;em&gt; element is for words that have a stressed emphasis compared to surrounding text, which is often limited to a word or words of a sentence and affects the meaning of the sentence itself.<p>Typically this element is displayed in italic type. However, it should not be used to apply italic styling; use the CSS font-style property for that purpose. Use the &lt;cite&gt; element to mark the title of a work (book, play, song, etc.). Use the &lt;i&gt; element to mark text that is in an alternate tone or mood, which covers many common situations for italics such as scientific names or words in other languages.
      • benatkin1 day ago
        Much of it is a structural representation of how to display the content.
        • cluckindan20 hours ago
          In practice, sometimes. But in principle, hard disagree.<p>HTML was explicitly designed to semantically represent scientific documents. [1]<p>”HTML documents represent a media-independent description of interactive content. HTML documents might be rendered to a screen, or through a speech synthesizer, or on a braille display. To influence exactly how such rendering takes place, authors can use a styling language such as CSS.” [2]<p>1: <a href="https:&#x2F;&#x2F;html.spec.whatwg.org&#x2F;multipage&#x2F;introduction.html#background" rel="nofollow">https:&#x2F;&#x2F;html.spec.whatwg.org&#x2F;multipage&#x2F;introduction.html#bac...</a><p>2: <a href="https:&#x2F;&#x2F;html.spec.whatwg.org&#x2F;multipage&#x2F;introduction.html#:~:text=HTML%20documents%20represent%20a%20media%2Dindependent%20description%20of%20interactive%20content." rel="nofollow">https:&#x2F;&#x2F;html.spec.whatwg.org&#x2F;multipage&#x2F;introduction.html#:~:...</a>
      • Theodores23 hours ago
        I like Arxiv and what they are doing, however, do the auto-generated HTML files contain nothing more than a sea of divs dressed with a billion classes?<p>I would be delighted if they could do better than that, with figcaptions as well as figures, and sections &#x27;scoped&#x27; with just one &lt;h2-6&gt; heading per section. They could specify how it really should be done, the HTML way, with a well defined way of doing the abstract and getting the cited sources to be in semantic markup yet not in some massive footer at the back.<p>There should also be a print stylesheet so that the paper prints out elegantly on A4 paper. Yes, I know you can &#x27;print to PDF&#x27; but you can get all the typesetting needed in modern CSS stylesheets.<p>Furthermore, they need to write a whole new HTML editor that discards WYSIWYG in favour of semantic markup. WYSIWYG has held us back by decades as it is useless for creating a semantic document. We haven&#x27;t moved on from typewriters and the conventions needed to get those antiques to work, with word processors just emulating what people were used to at the time. What we really need is a means to evolve the written word, so that our thinking is &#x27;semantic&#x27; when we come to put together documents, with a &#x27;document structure first&#x27; approach.<p>LaTeX is great, however, last time I used it was many decades ago, when the tools were &#x27;vi&#x27; (so not even vim) and GhostScript, running on a Sun workstation with mono screen. Since then I have done a few different jobs and never have I had the need to do anything in LaTex or even open a LaTeX file. In the wild, LaTeX is rarer than hen&#x27;s teeth. Yet we all read scientific papers from time to time, and Arxiv was founded on the availability of Tex files.<p>The lack of widespread adoption of semantic markup has been a huge bonus to Google and other gatekeepers that have the money to develop their own heuristics to make sense of &#x27;seas of divs&#x27;. As it happens, Google have also been somewhat helpful with Chrome and advancing the web, even if it is for their gatekeeping purposes.<p>The whole world of gatekeeping is also atrocious in academia. Knowledge wants to be free, but it is also big business to the likes of Springer, who are already losing badly to open publishing.<p>As you say, in this instance, accessibility means screen readers, however, I hope that we can do better than that, to get back to the OG Tim Berners Lee vision of what the web should be like, as far as structuring information is concerned.
        • dginev21 hours ago
          You will be delighted. Feel free to inspect some sources.
    • dimal1 day ago
      Perfect is the enemy of good. HTML is good enough. Let’s get this done.<p>And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.
      • billconan1 day ago
        mixing rendering definitions with content (PDF) is something from the printer era, that is unsuitable for the digital era.<p>HTML was a digital format, but it wanted to be a generic format for all document types, not just papers, so it contains a lot of extras that a paper format doesn&#x27;t need.<p>for research papers, since they share the same structure, we can further separate content from rendering.<p>for example, if you want to later connect a paper with an AI, do you want to send &lt;div class=&quot;abstract&quot;&gt; ... ?<p>or do some nasty heuristic to extract the abstract? like document. getElementsByClassName(&quot;abstract&quot;)[0] ?
        • simonw1 day ago
          All of the interesting LLMs can handle a full paper these days without any trouble at all. I don&#x27;t think it&#x27;s worth spending much time optimizing for that use-case any more - that was much more important two years ago when most models topped out at 4,000 or 8,000 tokens.
    • m-schuetz1 day ago
      That&#x27;s a purist stance that&#x27;s never going to work out in praxtice. Authors will always want to adjust the presentation of content, and html might be even better suited for that than Latex, which as bad at both.
    • bob10291 day ago
      &gt; HTML is better than PDF<p>I disagree. PDF is the most desirable format for printed media and its analogues. Any time I plan to seriously entertain a paper from Arxiv, I print it out first. I prefer to have the author&#x27;s original intent in hand. Arbitrary page breaks and layout shifts that are a result of my specific hardware&#x2F;software configuration are not desirable to me in this context of use.
      • ACCount371 day ago
        I agree that PDF is best for things that are meant to be printed, no questions. But I wonder how common actually printing those papers is?<p>In research and in embedded hardware both, I&#x27;ve met some people who had entire stacks of papers printed out - research papers or datasheets or application notes - but also people who had 3 monitors and 64GB of RAM and all the papers open as browser tabs.<p>I&#x27;m far closer to the latter myself. Is this a &quot;generational split&quot; thing?
        • pfortuny1 day ago
          Possibly, but then again, when I need to study a paper, I print it, when I need just to skim it and use a result from it, it is more likely that I just read it on a screen (tablet&#x2F;monitor). That is the difference for me.
      • s0rce1 day ago
        I used to print papers, probably stopped about 10 years ago. I now read everything in Zotero where I can highlight and save my annotations and sync my library between devices. You can also seamlessly archive html and pdfs. I don&#x27;t see people printing papers in my workplace that often unless you need to read them in a wet lab where the computer is not convenient.
    • afavour1 day ago
      Wouldn’t that be CSS?
      • billconan1 day ago
        no<p>&lt;div class=&quot;abstract-container&quot;&gt;<p>&lt;div class=&quot;abstract&quot;&gt;<p>&lt;pre&gt;&lt;code&gt; abstract text ... &lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;<p>&lt;&#x2F;div&gt;<p>&lt;div class=&quot;author-list&quot;&gt;<p>&lt;ol&gt;<p>&lt;li&gt;author one&lt;&#x2F;li&gt;<p>&lt;li&gt;author two&lt;&#x2F;li&gt;<p>&lt;ol&gt;<p>&lt;&#x2F;div&gt;<p>should be just:<p>[abstract]<p>abstract text<p>[authors]<p>author one | email | affiliation<p>author two | email | affiliation
        • afavour1 day ago
          Sounds like XML and XSL would be a great fit here. Shame it’s being deprecated.<p>But you could still use HTML. Elements with a dash in are reserved for custom elements (that is, a new standardised element will never take that name) so you could do:<p><pre><code> &lt;paper-author-list&gt; &lt;paper-author &#x2F;&gt; &lt;&#x2F;paper-author-list&gt; </code></pre> And it would be valid HTML. Then you’d style it with CSS, with<p><pre><code> paper-author { display: list-item; } </code></pre> And so on.
          • bawolff1 day ago
            Nothing is stopping you from using server side XSL. I personally dont think its a great fit, but people need to stop acting like xsl has been wiped from the face of the earth.
            • afavour1 day ago
              Yes but we’re specifically talking about a display format here. Something requiring a server side transform before being viewable by a user is a clear step backwards.
              • bawolff1 day ago
                How so? I can&#x27;t think of any advantage to having client side xsl over outputting two files, in this context.
                • afavour1 day ago
                  The discussion is about the form in which you share papers. With HTML you just share the HTML file, it opens instantly on basically any device.<p>If you distribute the paper as XML with an XSLT transform you need to run something that’ll perform that transform before you can read the paper. No matter whether that transform happens on the server or on the client it’s still an extra complication in the flow of sharing information.
          • xworld211 day ago
            Indeed, LaTeXML (the software used by arXiv) converts LaTeX to a semantic XML document which is turned to HTML using primarily XSLT!
        • panzi1 day ago
          There is &lt;article&gt; &lt;section&gt; &lt;figure&gt; &lt;legend&gt;, but yes, &lt;abstract&gt; and &lt;authors&gt; is missing as such. But there are meta tags for such things. Then there is RDF and Thing. Not quite the same, I know, but it&#x27;s not completely useless.
          • kevindamm1 day ago
            and you could shim these gaps with custom components, hypothetically
  • chr15m19 hours ago
    Wish I could upvote this harder. Thank you arXiv!
  • teddy-smith1 day ago
    It&#x27;s extremely easy to convert HTML&#x2F;CSS to a PDF with the print to PDF feature of the browser.<p>All papers should be in HTML&#x2F;CSS or Tex then just simply converted to PDF.<p>Why are we even talking about this?
    • tefkah1 day ago
      What are you talking about? No one’s writing their paper in HTML.<p>The problem is having the submissions be in TeX and converting that to HTML, when the only output has been PDF for so long.<p>The problem isn’t converting HTML to PDF, it’s making available a giant portion of TeX&#x2F;pdf only papers in HTML.<p>If you’re arguing that maybe TeX then shouldn’t be the source format for papers then I agree, but other than Typst (which also isn’t perfect about HTML output yet) there aren’t that many widely accepted&#x2F;used authoring formats for physics&#x2F;math papers, which is what ArXiV primarily hosts.
      • teddy-smith20 hours ago
        This is what I&#x27;m talking about. HTML&#x2F;CSS is more powerful than PDF or TEX.<p><a href="https:&#x2F;&#x2F;csszengarden.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;csszengarden.com&#x2F;</a>
    • crazygringo23 hours ago
      Have you ever written a paper for publication?<p>HTML doesn&#x27;t support the necessary features. Citations in various formats, footnotes, references to automatically numbered figures and tables, I could go on and on.<p>HTML could certainly be extended to support those, but it hasn&#x27;t been. That&#x27;s why we&#x27;re talking about this.
      • teddy-smith20 hours ago
        Come on are you serious? HTML&#x2F;CSS is more powerful than TEX or PDF.<p><a href="https:&#x2F;&#x2F;csszengarden.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;csszengarden.com&#x2F;</a>
        • crazygringo18 hours ago
          Did you fully read my comment? Please point me to where HTML&#x2F;CSS provide the features I listed.<p>It doesn&#x27;t really matter if HTML&#x2F;CSS is more powerful at a hundred other layout things, if it doesn&#x27;t provide the absolute necessary features for papers.
          • teddy-smith3 hours ago
            Citations in various formats,<p>&gt; <a href="https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;HTML&#x2F;Reference&#x2F;Elements&#x2F;cite" rel="nofollow">https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;HTML&#x2F;Reference&#x2F;...</a><p>&gt; <a href="https:&#x2F;&#x2F;codepen.io&#x2F;tag&#x2F;citation" rel="nofollow">https:&#x2F;&#x2F;codepen.io&#x2F;tag&#x2F;citation</a><p>footnotes<p>&gt;<a href="https:&#x2F;&#x2F;codepen.io&#x2F;SitePoint&#x2F;pen&#x2F;QbMgvY" rel="nofollow">https:&#x2F;&#x2F;codepen.io&#x2F;SitePoint&#x2F;pen&#x2F;QbMgvY</a><p>references to automatically numbered figures and tables<p>&gt; <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;25869906&#x2F;table-auto-numbering-using-css" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;25869906&#x2F;table-auto-numb...</a>\<p>&gt; <a href="https:&#x2F;&#x2F;codepen.io&#x2F;MikeKelley&#x2F;pen&#x2F;GpXmEd" rel="nofollow">https:&#x2F;&#x2F;codepen.io&#x2F;MikeKelley&#x2F;pen&#x2F;GpXmEd</a>
    • ekjhgkejhgk1 day ago
      LOL what. You&#x27;re either trolling, or you&#x27;ve never written a paper in your life.
      • teddy-smith20 hours ago
        It sounds like you might not understand the power of modern HTML&#x2F;CSS.
    • nkrisc1 day ago
      So, uh, where do the HTML versions of the papers come from?
    • benatkin1 day ago
      It&#x27;s easy to convert PDF to HTML&#x2F;CSS, with similar results.<p>Either way it gets shoehorned.
    • carlosjobim1 day ago
      Except you can&#x27;t have page breaks, three links in a row, anchor links.
      • teddy-smith20 hours ago
        @media print { .page, .page-break { break-after: page; } }
        • carlosjobim19 hours ago
          It doesn&#x27;t function in real use, it&#x27;s just theoretical.
          • teddy-smith3 hours ago
            <a href="https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;CSS&#x2F;Reference&#x2F;Properties&#x2F;break-after" rel="nofollow">https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;CSS&#x2F;Reference&#x2F;P...</a><p>Literally part of Mozilla&#x27;s docs.
  • _dain_1 day ago
    Wasn&#x27;t the World Wide Web invented at CERN <i>specifically</i> for sharing scientific papers? Why are we still using PDFs at all?
    • fsh1 day ago
      No, it wasn&#x27;t. Scientists at CERN used DVI and later PDF like everyone else. HTML has no provisions for typesetting equations and is therefore not suitable for physics papers (without much newer hacks such as MathML).
      • teddy-smith3 hours ago
        Why not typeset in something else and import the image into html&#x2F;css?
  • cubefox1 day ago
    This is not new, the title should say (2023). They have shipped the HTML feature with &quot;experimental&quot; flag for two years now, but I don&#x27;t know whether there is even any plan to move out of the experimental phase.<p>It&#x27;s not much of an &quot;experiment&quot; if you don&#x27;t plan to use some experimental data to improve things somehow.
  • lalithaar1 day ago
    I was reading through this article too, glad to have found it on here
  • rootnod31 day ago
    Maybe unpopular, but papers should be in n markdown flavor to be determined. Just to have them more machine readable.
    • xigoi1 day ago
      Compared to HTML, Markdown is very bad at being mahcine-readable.
    • doc_ick1 day ago
      [dead]
  • vatsachak1 day ago
    Why do we like HTML more than pdfs?<p>HTML rendering requires you to be connected to the internet, or setting up the images and mathJax locally. A PDF just works.<p>HTML obviously supports dynamic embedding, such as programs, much better but people just usually post a github.io page with the paper.
    • devnull31 day ago
      &gt; HTML rendering requires you to be connected to the internet<p>Not really. One can always generate a self-contained html. Both CSS and JS (if needed) can be inline.
      • vatsachak1 day ago
        True but the webdev idiom is injecting things such as mathjax from a cdn. I guess one can pre-render the page and save that, but that&#x27;s kind of like a PDF already
    • mmooss1 day ago
      epub &#x27;just works&#x27; locally, and it&#x27;s html under the hood.
    • nine_k1 day ago
      Try opening a PDF on a phone screen.
      • vatsachak1 day ago
        I do it all the time to read papers. It&#x27;s easy
    • recursive1 day ago
      Why would html rendering require a network connection? It doesn&#x27;t seem to on my machine.
      • vatsachak1 day ago
        Things like LaTeX equation rendering are hosted on a cdn
        • krapp1 day ago
          They can be but don&#x27;t need to be. Any javascript can be localized like HTML and CSS.
          • vatsachak1 day ago
            That&#x27;s fair, but imagine trying to get the average reader up to speed with something like npm.
            • krapp22 hours ago
              You don&#x27;t actually need npm either. You can literally just distribute everything - html, css, images and js in a zipped folder and open it locally.