Every GitHub object has two IDs

(greptile.com)

327 points by dakshgupta24 days ago

18 comments

innoying24 days ago
The newer global node IDs (which can be forced via the 'X-Github-Next-Global-ID' header [1]) have a prefix indicating the "type" of object delimited by an underscore, then a base64 encoded msgpack payload. For most objects it contains just a version (starting at 0) followed by the numeric "databaseId" field, but some are more complex.For example, my GitHub user [2] has the node ID "U_kgDOAAhEkg". Users are "U_" and then the remaining data decodes to: [0, 541842] which matches the numeric ID for my user accepted by the REST API [3].You shouldn't rely on any of this implementation of course, instead just directly query the "databaseId" field from the GraphQL API where you need interoperability. And in the other direction the REST API returns the "node_id" field for the GraphQL API.For folks who finds this interesting, you might also like [4] which details GitHub's ETag implementation for the REST API.[1] <a href="https://docs.github.com/en/graphql/guides/migrating-graphql-global-node-ids" rel="nofollow">https://docs.github.com/en/graphql/guides/migrating-graphql-...</a> [2] <a href="https://api.github.com/user/541842" rel="nofollow">https://api.github.com/user/541842</a> [3] <a href="https://gchq.github.io/CyberChef/#recipe=Find_/_Replace(%7B'option':'Regex','string':'%5E%5B%5E_%5D%2B_'%7D,'',true,false,true,false)From_Base64('A-Za-z0-9%2B/%3D',true,false)From_MessagePack()&input=VV9rZ0RPQUFoRWtn" rel="nofollow">https://gchq.github.io/CyberChef/#recipe=Find_/_Replace(%7B'...</a> [4] <a href="https://github.com/bored-engineer/github-conditional-http-transport" rel="nofollow">https://github.com/bored-engineer/github-conditional-http-tr...</a>
siralonso24 days ago
I wouldn't decode them like this, it's fragile, and global node IDs are supposed to be opaque in GraphQL.I see that GitHub exposes a `databaseId` field on many of their types (like PullRequest) - is that what you're looking for? [1]Most GraphQL APIs that serve objects that implement the Node interface just base-64-encode the type name and the database ID, but I definitely wouldn't rely on that always being the case. You can read more about global IDs in GraphQL in the spec in [2].[1] <a href="https://docs.github.com/en/graphql/reference/objects#pullrequest" rel="nofollow">https://docs.github.com/en/graphql/reference/objects#pullreq...</a> [2] <a href="https://graphql.org/learn/global-object-identification/" rel="nofollow">https://graphql.org/learn/global-object-identification/</a>
- siralonso24 days ago
 Also, as pointed out below, Github's GraphQL types also include fields like `permalink` and `url` (and interfaces like `UniformResourceLocatable`) that probably save you from needing to construct it yourself.
- catlifeonmars24 days ago
 If you want to store metadata in identifiers, an easy fix to preventing users from depending on arbitrary characteristics is to encrypt the data. You see this a lot with pagination tokens.
- tracker123 days ago
 +1, these type of behaviors are really likely to change/break in time. There's a reason the API tends to include permalink URLs... new Id's and the link pattern may very well change dramatically over time.
haileys24 days ago
> That repository ID (010:Repository2325298) had a clear structure: 010 is some type enum, followed by a colon, the word Repository, and then the database ID 2325298.It's a classic length prefix. Repository has 10 chars, Tree has 4.
- zahlman24 days ago
 Reminds me of the BitTorrent protocol.
- esafak24 days ago
 Almost a URN.
ezyang24 days ago
I just want to point out that Opus 4.5 actually knows this trick and will write the code to decode the IDs if it is working with GitHub's API lol
csomar24 days ago
This makes no sense. I am developing a product in this space (<a href="https://codeinput.com" rel="nofollow">https://codeinput.com</a>) and GitHub API and GraphQl is a badly entangled mess but you don’t trick your way around ids.There is actually a documented way to do it: <a href="https://docs.github.com/en/graphql/guides/using-global-node-ids" rel="nofollow">https://docs.github.com/en/graphql/guides/using-global-node-...</a>Same for urls, you are supposed to get them directly from GitHub not construct them yourself as format can change and then you find yourself playing a refactor cat-and-mouse game.Best you can do is an hourly/daily cache for the values.
blurayfin24 days ago
Most of what the author discovered is real and technically correct, but it is also undocumented, unsupported, and risky to rely on.GitHub has changed node ID internals before, quietly. If they add a field to the MessagePack array, switch encodings, encrypt payloads, introduce UUID-backed IDs..every system relying on this will break instantly.
phibz24 days ago
In database design typically it recommends giving out opaque natural keys, and keeping your monotonically increasing integer IDs secret and used internally.
- bastawhiz24 days ago
 That is a best practice for two real reasons:1. You don't want third parties to know how many objects you have2. You don't want folks to be able to iterate each object by incrementing the idBut if you have composite IDs like this, that doesn't matter. All objects that belong to a repository have the repository id inside them. Incrementing the id gives you more objects from the same repo. Incrementing the repo id gives you...a random object or nothing at all. And if your IDs include a little entropy or a timestamp, you've effectively kneecapped anyone who's trying to abuse this.
 - maxbond24 days ago
 > You don't want folks to be able to iterate each object by incrementing the idIf you have a lot of public or semi-public data that you don't want people to page through, then I suppose this is true. But it's important to note that separate natural and primary keys are not a replacement for authorization. Random keys may mitigate an IDOR vulnerability but authorization is the correct solution. A sufficiently long and securely generated random token can be used as both as an ID and for authorization, like sharing a Google Doc with "anyone who has a link," but those requirements are important.
 - bastawhiz23 days ago
 I don't disagree. But it's embarrassing when someone is like "your users have only used this feature 150 times?"
 - high_na_euv23 days ago
 What if you used some id that does not allow to count objs like guid?
 - bastawhiz23 days ago
 Uuid4 is extremely random, which makes it bad for most database indexes. You can use uuid7 instead.Uuid7 would not have helped GitHub, though, because it doesn't solve the sharding issue.
- taftster24 days ago
 Maybe. Until your natural key changes. Which happens. A lot.Exposing a surrogate / generated key that is effectively meaningless seems to be wise. Maybe internally Youtube has an index number for all their videos, but they expose a reasonably meaningless coded value to their consumers.
naikrovek24 days ago
Remember this article when you get upset that your own customers have come to rely on behavior that you told them explicitly not to rely on.If it is possible to figure something out, your customers will eventually figure it out and rely on it.
- gavmor24 days ago
 Once a system has a sufficient number of users, it no longer matters what you "explicitly" promised in your documentation or contract.Hyrum’s Law: all observable behaviors of your system will eventually be depended on by someone.Even if you tell users not to rely on a specific side effect, once they discover it exists and find it useful, that behavior becomes an implicit part of your system's interface. As a result, engineers often find that "every change breaks someone’s workflow," even when that change is technically a bug fix or a performance improvement.Reliance on unpromised behavior is something I was also introduced to as Kranz’s Law (or Scrappy's Law*), which asserts that things eventually get used for their inherent properties and effects, without regard for their intended purpose."I insisted SIGUSR1 and SIGUSR2 be invented for BSD. People were grabbing system signals to mean what they needed them to mean for IPC, so that (for example) some programs that segfaulted would not coredump because SIGSEGV had been hijacked. This is a general principle — people will want to hijack any tools you build, so you have to design them to either be un-hijackable or to be hijacked cleanly. Those are your only choices." —Ken Arnold in The Art Of Unix Programming
 - ninju23 days ago
 Obligatory XKCD for this<a href="https://xkcd.com/1172/" rel="nofollow">https://xkcd.com/1172/</a>
bob102924 days ago
The only GitHub identifier Ive ever bothered to store explicitly (I.e., in its own dedicated column) is an immutable URL key like issue/pr # or commit hash. I've stored comment ids but I've never thought about it. They just get sucked up with the rest of the JSON blob.Not everything has to be forced through some normalizing layer. You can maintain coarse rows at the grain of each issue/PR and keep everything else in the blob. JSON is super fast. Unless you're making crosscutting queries along comment dimensions, I don't think this would ever show up on a profiler.
- progval24 days ago
 > immutable URL key like issue/prthey are not immutable because repositories can change URLs (renamed or moved to a different org).
 - bob102923 days ago
 Issue #, commit hashes, etc. are still immutable in this scenario. When you rename or transfer a GitHub repository, all of these keys are preserved.What I do is store 2 tuples:Repository: (Id, Org, Repo)Issue/PR: (Repository.Id, #)Transferring or renaming a repository is an update to 1 row in this schema.
asmor24 days ago
I remember a time when the v3 API didn't even have IDs and whenever someone in your org changed their username or renamed a repo you'd be left guessing who it is.This is also the reason I wrote our current user onboarding / repo management code from scratch, because the terraform provider sucks and without any management you'll have a wave of "x got offboarded but they were the only admin on this repo" requests. Every repo is owned by a team. Access is only ever per-team.
- embedding-shape23 days ago
 > Every repo is owned by a team. Access is only ever per-team.This is indeed the working pattern, and applies not just to GitHub and organizing teams there, it's a useful pattern to use everywhere. Don't think "Give access to this user" but rather "Give access to this team, which this user is current a part of" and it solves a lot of bothersome issues.
 - asmor22 days ago
 The problem is that GitHub gives admin access to the person who clicked the create repository button personally and then calls it a day.And the only real way around this is to make people create repositories elsewhere, on a self-service dashboard.
inopinatus24 days ago
Github staff have racked up hundreds of contributions to Rails in recent years to extend the sharded/multiple database support, now you know why.
rtrgrd24 days ago
Hyrums law at its finest :D (or D: if you deeply care about correctness)
zzo38computer24 days ago
I had seen GitHub node IDs, although I had not used them or tried to decode them (although I could see they seem to be base64), since I only used the REST API, which reports node IDs but does not use them as input.It looks like a good explanation of the node IDs, though. However, like another comment says, you should not rely on the format of node IDs.
- catlifeonmars24 days ago
 GitHub should have encrypted their node ids. Now they risk breaking a ton of users if they decide they want to change anything about the id generation (for example to scale the db horizontally by removing sequential ids).Several companies I’ve worked for have had policies outright blocking the use of nonrandom identifiers in production code.
 - misiek0824 days ago
 Just out of curiosity, because I saw way too many i++ implementations - were things like UUIDv7 allowed or because of timestamp they are not random enough? While having such conversations I assume it’s already good enough, but maybe I’ll learn something here!
 - catlifeonmars24 days ago
 UUIDv7 wasn’t as ubiquitous at the time so it did not come up.Here’s how I would think about it: do I want users to depend on the ordering of UUIDv7? Now I’m tied to that implementation and probably have to start worrying about clock skew.If it’s not a feature you want to support, don’t expose it. Otherwise you’re on the hook for it. If you do explicitly want to provide time ordering as a feature, then UUIDv7 is a great choice and preferable to rolling your own format.
bastawhiz24 days ago
1. The list of "scopes" are the object hierarchy that owns the resource. That lets you figure out which shard a resource should be in. You want all the resources for the same repository on the same shard, otherwise if you simply hash the id, one shard going down takes down much of your service since everything is spread more or less uniformly across shards.2. The object identifier is at the end. That should be strictly increasing, so all the resources for the same scope are ordered in the DB. This is one of the benefits of uuid7.3. The first element is almost certainly a version. If you do a migration like this, you don't want to rule out doing it again. If you're packing bits, it's nearly impossible to know what's in the data without an identifier, so without the version you might not be able know whether the id is new or old.Another commenter mentioned that you should encrypt this data. Hard pass! Decrypting each id is decidedly slower than b64 decode. Moreover, if you're picking apart IDs, you're relying on an interface that was never made for you. There's nothing sensitive in there: you're just setting yourself up for a possible (probable?) world of pain in the future. GitHub doesn't have to stop you from shooting your foot off.Moreover, encrypting the contents of the ID makes them sort randomly. This is to be avoided: it means similar/related objects are not stored near each other, and you can't do simple range scans over your data.You could decrypt the ids on the way in and store both the unencrypted and encrypted versions in the DB, but why? That's a lot of complexity, effort, and resources to stop randos on the Internet from relying on an internal, non-sensitive data format.As for the old IDs that are still appearing, they are almost certainly:1. Sharded by their own id (i.e., users are sharded by user id, not repo id), so you don't need additional information. Use something like rendezvous hashing to choose the shard.2. Got sharded before the new id format was developed, and it's just not worth the trouble to change
- inopinatus24 days ago
 AES is faster than base64 on modern CPUs, especially for small messages.
 - bastawhiz23 days ago
 AES would mean the encrypted parts of the id are ~28+ bytes. That's a long minimum identifier length.What you're suggesting is perhaps true in the sense that the throughout is higher, but AES decryption carries a fairly high fixed overhead. If you're in a language like Ruby (as GitHub is) or Python/Node, you're probably calling out to openssl.I did try to do my diligence and find data to support or refute your claim, but I wasn't able to find anything that does directly. That said, I'm not able to find any sources that support the idea that AES is faster at decryption than base64 in any context (for small plaintext values or in general). With SIMD, b64 often decodes in 0.2 CPU cycles or so per byte, while AES only manages 2.5-10.7 CPU cycles per byte. The numbers for AES get better as the plaintext size grow, though.Do you happen to have data to support your claim?
 - inopinatus23 days ago
 Yes.
 - bastawhiz23 days ago
 Okay.
agwa24 days ago
> GitHub's migration guide tells developers to treat the new IDs as opaque strings and treat them as references. However it was clear that there was some underlying structure to these IDs as we just saw with the bitmaskingGreat, so now GitHub can't change the structure of their IDs without breaking this person's code. The lesson is that if you're designing an API and want an ID to be opaque you have to literally encrypt it. I find it really demoralizing as an API designer that I have to treat my API's consumers as adversaries who will knowingly and intentionally ignore guidance in the documentation like this.
- krisoft24 days ago
 > Great, so now GitHub can't change the structure of their IDs without breaking this person's code.And that is all the fault of the person who treated a documented opaque value as if it has some specific structure.> The lesson is that if you're designing an API and want an ID to be opaque you have to literally encrypt it.The lesson is that you should stop caring about breaking people’s code who go against the documentation this way. When it breaks you shrug. Their code was always buggy and it just happened to be working for them until then. You are not their dad. You are not responsible for their misfortune.> I find it really demoralizing as an API designer that I have to treat my API's consumers as adversaries who will knowingly and intentionally ignore guidance in the documentation like this.You don’t have to.
 - vlovich12324 days ago
 Sounds like you’ve maybe never actually run a service or API library at scale. There’s so many factors that go into a decision like that at a company that it’s never so simple. Is the person impacted influential? You’ve got a reputation hit if they negatively blog about how you screwed them after something was working for years. Is a customer who’s worth 10% of your annual revenue impacted? Bet your ass your management chain won’t let you do a breaking change / revert any you made by declaring an incident.Even in OSS land, you risk alienating the community you’ve built if they’re meaningfully impact. You only do this if the impact is minimal or you don’t care about alienating anyone using your software.
 - irjustin24 days ago
 > Sounds like you’ve maybe never actually run a service or API library at scale.What was the saying? When your scale is big enough, even your bugs have users.
 - raincole24 days ago
 Yeah, but when you are big enough you can afford to not care individual users.VScode once broke a very popular extension that used a private API. Microsoft (righteously) didn't bother to ask if the private API had users.
 whatevaa23 days ago
 VScode is free, so not really money on the line. Easy decision. Things get complicated when money gets involved.
 vlovich12323 days ago
 If you think that just because VSCode is free there's no money on the line, you're not thinking about things the way others do. As I said, reputation alone definitely has a cost and Microsoft has partnerships where VSCode is strategic. They probably just made a calculation that there's not enough users and/or the users using that private API were strategically misaligned with their direction.
 - halestock24 days ago
 > The lesson is that you should stop caring about breaking people’s code who go against the documentation this way. When it breaks you shrug. Their code was always buggy and it just happened to be working for them until then. You are not their dad. You are not responsible for their misfortune.Sure, but good luck running a business with that mindset.
 - Kwpolska24 days ago
 Apple is pretty successful.
- maxbond24 days ago
 You could also say, if I tell you something is an opaque identifier, and you introspect it, it's your problem if your code breaks. I told you not to do that.
 - lelandfe24 days ago
 Once "you" becomes a big enough "them" it becomes a problem again.
 - vlovich12324 days ago
 Exactly. When you owe the bank $10M it’s a you problem. When you owe the bank $100B it’s a them problem.
- bigblind24 days ago
 I think more important than worrying about people treating an opaque value as structured data, is wondering _why_ they're doing so. In the case of this blog post, all they wanted to do was construct a URL, which required the integer database ID. Just make sure you expose what people need, so they don't need to go digging.Other than that, I agree with what others are saying. If people rely on some undocumented aspect of your IDs, it's on them if that breaks.
 - plorkyeran24 days ago
 Exposing what people need doesn’t guarantee that they won’t go digging. It is surprisingly common to discover that someone has come up with a hack that depends on implementation details to do something which you exposed directly and they just didn’t know about it.
 - Macha24 days ago
 GitHub actually do have both the database ID and URL available in the API:<a href="https://docs.github.com/en/graphql/reference/objects#pullrequest" rel="nofollow">https://docs.github.com/en/graphql/reference/objects#pullreq...</a>OP’s requirements changed and they hadn’t stored them during their crawl
- kevin_thibedeau24 days ago
 > Great, so now GitHub can't change the structure of their IDs without breaking this person's codeOP can put the decoded IDs into a new column and ignore the structure in the future. The problem was presumably mass querying the Github API to get those numbers needed for functional URLs.
- vlovich12324 days ago
 Literally how I designed all the public facing R2 tokens like multipart uploads. It’s also a security barrier because forging and stealing of said tokens is harder and any vulnerability has to be done with cooperation of your servers and can be quickly shut down if needed.
- cush24 days ago
 At a big enough scale, even your bugs have users
- haileys24 days ago
 This is well understood - Hyrum's law.You don't need encryption, a global_id database column with a randomly generated ID will do.
 - maxbond24 days ago
 You could but you would lose the performance benefits you were seeking by encoding information into the ID. But you could also use a randomized, proprietary base64 alphabet rather than properly encrypting the ID.
 - pdpi24 days ago
 XOR encryption is cheap and effective. Make the key the static string "IfYouCanReadThisYourCodeWillBreak" or something akin to that. That way, the key itself will serve as a final warning when (not if) the key gets cracked.
 - Retr0id24 days ago
 Any symmetric encryption is ~free compared to the cost of a network request or db query.In this particular instance, Speck would be ideal since it supports a 96-bit block size <a href="https://en.wikipedia.org/wiki/Speck_(cipher)" rel="nofollow">https://en.wikipedia.org/wiki/Speck_(cipher)</a>
 pdpi24 days ago
 Symmetric encryption is computationally ~free, but most of them are conceptually complex. The purpose of encryption here isn't security, it's obfuscation in the service of dissuading people from depending on something they shouldn't, so using the absolutely simplest thing that could possibly work is a positive.
 Retr0id24 days ago
 XOR with fixed key is trivially figure-out-able, defeating the purpose. Speck is simple enough that a working implementation is included within the wikipedia article, and most LLMs can oneshot it.
 - maxbond24 days ago
 A cryptographer may quibble and call that an encoding but I agree.
 pdpi24 days ago
 A cryptographer would say that XOR ciphers are a fundamental cryptography primitive, and e.g. the basic building blocks for one-time pads.
 maxbond24 days ago
 Yes, XOR is a real and fundamental primitive in cryptography, but a cryptographer may view the scheme you described as violating Kerckhoffs's second principle of "secrecy in key only" (sometimes phrased, "if you don't pass in a key, it is encoding and not encryption"). You could view your obscure phrase as a key, or you could view it as a constant in a proprietary, obscure algorithm (which would make it an encoding). There's room for interpretation there.Note that this is not a one-time pad because we are using the same key material many times.But this is somewhat pedantic on my part, it's a distinction without a difference in this specific case where we don't actually need secrecy. (In most other cases there would be an important difference.)
 - haileys24 days ago
 Encoding a type name into an ID is never really something I've viewed as being about performance. Think of it more like an area code, it's an essential part of the identifier that tells you how to interpret the rest of it.
 - maxbond24 days ago
 That's fair, and you could definitely put a prefix and a UUID (or whatever), I failed to consider that.
- nwallin24 days ago
 Hyrum's law is a real sonuvabitch.
- lijok24 days ago
 Can GitHub change their API response rate? Can they increase it? If they do, they’ll break my code ‘cause it expects to receive responses at least after 1200ms. Any faster than that and I get race conditions. I selected the 1200ms number by measuring response rates.No, you would call me a moron and tell me to go pound sand.Weird systems were never supported to begin with.
- perfmode24 days ago
 The API contract doesn’t stipulate the behavior so GitHub is free to change as they please.
- whateveracct24 days ago
 Who cares if their code is broken in this case? Stupid games stupid prizes.
drob51823 days ago
Seems brittle unless GitHub is guaranteeing that they will never change their ID algorithms.
chatmasta24 days ago
> Somewhere in GitHub's codebase, there's an if-statement checking when a repository was created to decide which ID format to return.I doubt it. That's the beauty of GraphQL — each object can store its ID however it wants, and the GraphQL layer encodes it in base64. Then when someone sends a request with a base64-encoded ID, there _might_ be an if-statement (or maybe it just does a lookup on the ID). If anything, the if-statement happens _after_ decoding the ID, not before encoding it.There was never any if-statement that checked the time — before the migration, IDs were created only in the old format. After the migration, they were created in the new format.
Kwpolska24 days ago
Some commenters mentioned that the GraphQL API exposes the database IDs and a URL to the object directly in the schema. But the author did not know that and instead reverse-engineered the ID generation logic.This is one of many reasons why GraphQL sucks. Developers will do anything to avoid reading docs. In the REST API, the developerId and url fields would be easily discoverable by looking at the API response. But in GraphQL, there is no way to get all fields. You need to get a list of fields from the docs and request them explicitly.
- ldjb24 days ago
 The author may well have been aware of this. However, since the author didn't retrieve those database IDs or URLs in the first place, they would have had to make further requests to retrieve them, which they wanted to avoid doing."I was looking at either backfilling millions of records or migrating our entire database, and neither sounded fun."