Index, Count, Offset, Size

(tigerbeetle.com)

111 points by ingve3 days ago

12 comments

jp101617 minutes ago
Banning "length" from the codebase and splitting the concept into count vs size is one of those things that sounds pedantic until you've spent an hour debugging an off-by-one in serialization code where someone mixed up "number of elements" and "number of bytes." After that you become a true believer.The big-endian naming convention (source_index, target_index instead of index_source, index_target) is also interesting. It means related variables sort together lexicographically, which helps with grep and IDE autocomplete. Small thing but it adds up when you're reading unfamiliar code.One thing I'd add: this convention is especially valuable during code review. When every variable that represents a byte quantity ends in _size and every item count ends in _count, a reviewer can spot dimensional mismatches almost mechanically without having to load the full algorithm into their head.
wahern10 hours ago
Relatedly, a survey of array nomenclature was performed for the ISO C committee when choosing the name of the new countof operator: <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3469.htm" rel="nofollow">https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3469.htm</a>It was originally proposed as lengthof, but the results of the public poll and the ambiguity convinced the committee to choose countof, instead.
- pansa233 minutes ago
 The reason many languages prefer `length` to `count`, I think, is that the former is clearly a noun and the latter could be a verb. `length` feels like a simple property of a container whereas `count` could be an algorithm.`countof` removes the verb possibility - but that means that a preference for `countof` over `lengthof` isn't necessarily a preference for `count` over `length`.
 - ncruces22 minutes ago
 But count is more clearly a dimensionless number of elements, and not a size measured in some unit (e.g. bytes).
cb3212 hours ago
As @SkiFire correctly observes[^1], off-by-1 problems are more fundamental than 0-based or 1-based indices, but the latter still vary enough that some kind of discrimination is needed.For many years (decades?) now, I've been using "index" for 0-based and "number" for 1-based as in "column index" for a C/Python style [ix] vs. "column number" for a shell/awk/etc. style $1 $2. Not sure this is the best terminology, but it is nice to have something consistent. E.g., "offset" for 0-based indices means "off" and even the letter "o" in some case becomes "the zero of some range". So, "offset" might be better than "index" for 0-based.[^1]: <a href="https://news.ycombinator.com/item?id=47100056">https://news.ycombinator.com/item?id=47100056</a>
- matklad1 hour ago
 Ha! I also use `line_number = line_index + 1` convention!
- throwaway274481 hour ago
 Ordinal is nice because it explicitly starts at 1.
 - adrian_b41 minutes ago
 Nit pick: only in few human languages the ordinal numbers start at 1.In most modern languages, the ordinal numbers start at 2. In most old languages, and also in English, the ordinal numbers start at 3.The reason for this is the fact that ordinal numbers have been created only recently, a few thousand years ago.Before that time, there were special words only for certain positions of a sequence, i.e. for the first and for the last element and sometimes also for a few elements adjacent to those.In English, "first", "second" and "last", are not ordinal numbers, but they are used for the same purpose as ordinal numbers, though more accurately is to say that the ordinal numbers are used for the same purpose with these words, as the ordinal numbers were added later.The ancient Indo-European languages had a special word for the other element of a pair, i.e. the one that is not the first element of a pair. This word was used for what is now named "second". In late Latin, the original word that meant "the other of a pair" has been replaced with a word meaning "the following", which has been eventually also taken by English through French in the form of "second".
JSR_FDED7 hours ago
Using the same length of related variable names is definitely a good thing.Just lining things up neatly helps spot bugs.It’s the one thing I don’t like about strict formatters, I can no longer use spaces to line things up.
- craig552uk5 hours ago
 I've never yet seen a linter option for assignment alignment, but would definitely use it if it were available
 - ivanjermakov4 hours ago
 AlignConsecutiveAssignments in clang-format might be the right fit.<a href="https://clang.llvm.org/docs/ClangFormatStyleOptions.html" rel="nofollow">https://clang.llvm.org/docs/ClangFormatStyleOptions.html</a>
- skydhash3 hours ago
 I know prettier can isolate a code section from changes by adding comments. And I think others can too.
Fraterkes2 hours ago
Is there any reason to not just switch to 1-based indexing if we could? Seems like 0-based indexing really exacerbates off-by-one errors without much benefit
- SkiFire132 hours ago
 I'm not sure what that has to do with the article, but anyway: <a href="https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html" rel="nofollow">https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831...</a>That said, I'm not sure how 1-based indexing will solve off-by-1 errors. They naturally come from the fencepost problem, i.e. the fact that sometimes we use indexes to indicate elements and sometimes to indicate boundaries between them. Mixing between them in our reasoning ultimately results in off-by-1 issues.
 - Fraterkes2 hours ago
 This is an article that (among other things) talks about off-by-one errors being caused by mixing up index and count (and having to remember to subtract 1 when converting between the two). That's what it has to with it.
 - adrian_b1 hour ago
 If you always use half-open intervals, you never have to subtract 1 from anything.With half-open intervals, the count of elements is the difference between the interval bounds, adjacent intervals share 1 bound and merging 2 adjacent intervals preserves the extreme bounds.Any programming problem is simplified when 0-based indexing together with half-open intervals are always used, without exceptions.The fact that most programmers have been taught when young to use 1-based ordinal numbers and closed intervals is a mental handicap, but normally it is easy to get rid of this, like also getting rid of the mental handicap of having learned to use decimal numbers, when there is no reason to ever use them instead of binary numbers.
 - SkiFire1332 minutes ago
 I must have missed that part, my bad
 - cindyllm29 minutes ago
 [dead]
- pansa218 minutes ago
 Fundamentally, CPUs use 0-based addresses. That's unavoidable.We can't choose to switch to 1-based indexing - either we use 0-based everywhere, or a mixture of 0-based and 1-based. Given the prevalence of off-by-one errors, I think the most important thing is to be consistent.
- adrian_b1 hour ago
 This is a matter of opinion.My opinion is that 1-based indexing really exacerbates off-by-one errors, besides requiring a more complex implementation in compilers, which is more bug-prone (with 1-based addressing, the compilers must create and use, in a manner transparent for the programmer, pointers that do not point to the intended object but towards an invalid location before the object, which must never be accessed through the pointer; this is why using 1-based addressing was easier in languages without pointers, like the original FORTRAN, but it would have been more difficult in languages that allow pointers, like C, the difficulty being in avoiding to expose the internal representation of pointers to the programmer).Off-by-one errors are caused by mixing conventions for expressing indices and ranges.If you always use a consistent convention, e.g. 0-based indexing together with half-open intervals, where the count of elements equals the difference between the interval bounds, there are no chances for ever making off-by-one errors.
- tialaramex2 hours ago
 I would bet that in the opposite circumstance you'd say the same thing:"Is there any reason to not just switch to 0-based indexing if we could? Seems like 1-based indexing really exacerbates off-by-one errors without much benefit"The problem is that humans make off-by-one errors and not that we're using the wrong indexing system.
 - Fraterkes2 hours ago
 No indexing system is perfect, but one can be better than another. Being able to do array[array.length()] to get the last item is more concise and less error prone than having to add -1 every time.Programming languages are filled with tiny design choices that don’t completely prevent mistakes (that would be impossible) but do make them less likely.
 - adrian_b1 hour ago
 Having to use something like array[length] to get the last element demonstrates a defect of that programming language.There are better programming languages, where you do not need to do what you say.Some languages, like Ada, have special array attributes for accessing the first and the last elements.Other languages, like Icon, allow the use of both non-negative indices and of negative indices, where non-negative indices access the array from its first element towards its last element, while negative indices access the array from its last element towards its first element.I consider that your solution, i.e. using array[length] instead of array[length-1], is much worse. While it scores a point for simplifying this particular expression, it loses points by making other expressions more complex.There are a lot of better programming languages than the few that due to historical accidents happen to be popular today.It is sad that the designers of most of the languages that attempt today to replace C and C++ have not done due diligence by studying the history of programming languages before designing a new programming language. Had they done that, they could have avoided repeating the same mistakes of the languages with which they want to compete.
- bruce3434342 hours ago
 You say "seems like", can you argue/show/prove this?
 - Fraterkes2 hours ago
 I think that many obo errors are caused by common situations where people can mistakenly mix up index and count. You could eliminate a (small) set of those situations with 1-based indexing: accessing items from the ends of arrays/lists.
 - meindnoch1 hour ago
 And in turn you'd introduce off by one errors when people confuse the new 1-based indexes with offsets (which are inherently 0-based).So yeah, no. People smarter than you have thought about this before.
navane1 hour ago
I hoped to learn some more excel lookup tactics, alas
- donkeybeer29 minutes ago
  I was thinking sql.
qouteall9 hours ago
With modern IDE and AI there is no need to save letters in identifier (unless too long). It should be "sizeInBytes" instead of "size". It should be "byteOffset" "elementOffset" instead of "offset".
- pveierland7 hours ago
 When correctness is important I much prefer having strong types for most primitives, such that the name is focused on describing semantics of the use, and the type on how it is represented:<pre><code> struct FileNode { parent: NodeIndex<FileNode>, content_header_offset: ByteOffset, file_size: ByteCount, } </code></pre> Where `parent` can then only be used to index a container of `FileNode` values via the `std::ops::Index` trait.Strong typing of primitives also help prevent bugs like mixing up parameter ordering etc.
 - kqr7 hours ago
 I agree. Including the unit in the name is a form of Hungarian notation; useful when the language doesn't support defining custom types, but looks a little silly otherwise.
 - canucker20162 hours ago
 Depends on what variant of Hungarian you're talking about.There's Systems Hungarian as used in the Windows header files or Apps Hungarian as used in the Apps division at Microsoft. For Apps Hungarian, see the following URL for a reference - <a href="https://idleloop.com/hungarian/" rel="nofollow">https://idleloop.com/hungarian/</a>For Apps Hungarian, the variable incorporates the type as well as the intent of the variable - in the Apps Hungarian link from above, these are called qualifiers.so for the grandparent example, rewritten in C, would be something like:<pre><code> struct FileNode { FileNode *pfnParent; DWORD ibHdrContent; DWORD cb; } </code></pre> For Apps Hungarian, one would know that the ibHdrContent and cb fields are the same type 'b'. ib represents an index/offset in bytes - HdrContent is just descriptive, while cb is a count of bytes. The pfnParent field is a pointer to a fn-type with name Parent.One wouldn't mix an ib with a pfn since the base types don't match (b != fn). But you could mix ibHdrContent and cb since the base types match and presumably in this small struct, they refer to index/offset and count for the FileNode. You'd have only one cb for the FileNode but possibly one or more ibXXXX-related fields if you needed to keep track of that many indices/offsets.
- groundzeros20156 hours ago
 Long names become burdensome to read when they are used frequently in the same context
- ivanjermakov4 hours ago
 When the same name is used a thousand times in a codebase, shorter names start to make sense. See aviation manuals or business documentation, how abbreviation-dense they are.
- throwaway20279 hours ago
 Isn't that more tokens though?
 - post-it7 hours ago
 Only until they develop some kind of pre-AI minifier and sourcemap tool.
 - 0x4578 hours ago
 Sure you get one or two word extra worth of tokens, but you save a lot more compute and time figuring what exactly this offset is.
 - Onavo9 hours ago
 Not significantly, it's one word.
 - meindnoch1 hour ago
 Tokens are not words.
akdor11547 hours ago
The 'same length for complementary names' thing is great.
card_zero11 hours ago
I can't read the starts of any lines, the entire page is offset about 100 pixels to the left. :) Best viewed in Lynx?
- Flow6 hours ago
  Looks perfect here. iOS Safari
dataflow10 hours ago
Is there any other example of "length" meaning "byte length", or is it just Rust just being confusing? I've never seen this elsewhere.Offset is ordinarily just a difference of two indices. In a container I don't recall seeing it implicitly refer to byte offset.
- SabrinaJewson10 hours ago
 In general in Rust, “length” refers to “count”. If you view strings as being sequences of Unicode scalar values, then it might seem odd that `str::len` counts bytes, but if you view strings as being a subset of byte slices it makes perfect sense that it gives the number of UTF-8 code units (and it is analoguous to, say, how Javascript uses `.length` to return the number of UTF-16 code units). So I think it depends on perspective.
 - dataflow7 hours ago
 That makes sense, I agree -- seems Rust is on board here too.
- AlotOfReading10 hours ago
 It's the usual convention for systems programming languages and has been for decades, e.g. strlen() and std::string.length(). Byte length is also just more useful in many cases.
 - dataflow6 hours ago
 No, those are counts by definition, and byte lengths only by coincidence. Look at wcslen() and std::wstring::length().
- wyldfire8 hours ago
 A length could refer to lots of different units - elements, pages, sectors, blocks, N-aligned bytes, kbytes, characters, etc.Always good to qualify your identifiers with units IMO (or types that reflect units).
zephen9 hours ago
The invariant of index < count, of course, only works when using Djikstra's half-open indexing standard, which seems to have a few very vocal detractors.
- tromp5 hours ago
 See <a href="https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html" rel="nofollow">https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831...</a> for Dijkstra's thoughts on indexing.
- GolDDranks8 hours ago
 Fortunately only a few. Djikstra's is obviously the most reasonable system.
userbinator6 hours ago
Or learn an array language and never worry about indexing or naming ;-)Everything else looks disgustingly verbose once you get used to them.