Add cut_on_char and rcut_on_char to String/Bytes/StringLabels #14068

kit-ty-kate · 2025-06-03T14:24:51Z

String.cut is a function i recently had to reimplement in every single one of my personal [1] and work projects.
I feel like this is especially useful when for example parsing Unix.environment and similar things, where String.split_on_char is inappropriate at worst and inefficient at best, given that the right side of an environment binding can very well contain multiple = characters.

In terms of naming, looking at sherlocode.com quickly:

opam has OpamStd.String.cut and rcut: https://github.com/ocaml/opam/blob/8c0e68e7eb527223353cca1926b433c664f729c4/src/core/opamStd.ml#L694
patch has Lib.cut: https://github.com/hannesm/patch/blob/fe7077c7e5e55721e77e7dbc2af2c044851cef20/src/lib.mli#L4
base has Base.String.lsplit2: https://github.com/janestreet/base/blob/01857ea5364018edb77460517872164f501d1091/src/string_intf.ml#L334
omod has String.cut and rev_cut: https://github.com/dbuenzli/omod/blob/51430c07bfe46cd46edb785715c8f49e06dec546/src/omod.ml#L14
topkg has Topkg_string.cut ?rev: https://github.com/dbuenzli/topkg/blob/5118b194ad08df97298b206648dd2f521b65d9e3/src/topkg_string.mli#L24
flow used to have String_utils.split2: https://github.com/facebook/flow/blob/cdc8cc8dbe3f3813908d95d9ce02476f87669632/src/hack_forked/utils/string/string_utils.mli#L72
For the sake of completeness: Containers and Batteries only have the version of these functions taking a string as separator instead of a character and Extlib has no such function.

[1]: https://codeberg.org/kit-ty-kate/ocaml-posixutils/src/commit/49ba0e1b3c6ccd0421047bccaf96a6e92031f647/src/env.ml#L1 and 2 other unpublished projects.

dbuenzli · 2025-06-03T19:51:51Z

String separators please. API, Data.

kit-ty-kate · 2025-06-03T21:13:05Z

String separators please. API, Data.

Why not both? A function working with string separators requires more effort to implement efficiently and in a modular manner. I think it's better done later. I'm not sure i would have the time or interest at the moment to implement that anyway.

dbuenzli · 2025-06-03T21:43:36Z

Why not both?

API bloat.

I think it's better done later. I'm not sure i would have the time or interest at the moment to implement that anyway.

I disagree, it's not because you don't have time that we should add bloat to the stdlib.

In fact I have a TODO to upstream the various extremely useful¹ substring functions I have in my extended String module so that it can disappear in most of my programs.

I didn't yet because upstream would likely find naive search distasteful and I have been too lazy to implement something better (that was my candidate) and for my light usages naive string search is entirely "good enough".

But perhaps something I'd be willing to procrastinate on for the next release if I feel some good code vibes from upstream.

Well that one was proposed and rejected, which is extremely sad as it drastically cut down my index out of bounds errors. API usability is difficult to convey in writing. ↩

dra27 · 2025-06-05T05:58:16Z

Given (facetiously), that " " has a much more bloated representation than ' ', I'm not sure (non-facetiously) why having a function to cut on a character separator and (perhaps in the future) another function to cut on a string separator constitutes bloat - they are different types?

Especially as at the moment, if we were to add just a function to cut at a string separator we'd have something of an inconsistency between String.split_on_char and String.cut/String.rcut, right?

(incidentally, it does suggest that perhaps the functions should be cut_on_char / rcut_on_char?)

dra27 · 2025-06-05T06:24:32Z

stdlib/stringLabels.mli

@@ -207,6 +207,18 @@ val split_on_char : sep:char -> string -> string list

    @since 4.04 (4.05 in StringLabels) *)

+val cut : sep:char -> string -> (string * string) option


I'm wondering if this would be a good first function to use a labelled tuple in the result 🫣 (@ccasin, @goldfirere, @OlivierNicole - any JS examples, even potentially against doing this?) e.g. key/value, first/second, prefix/suffix, left/right?

If we were to go with a labelled tuple, I commit to sorting out #11792 (i.e. making syncing the docs work for this...)

For me this it's reasonably obvious which element is what: the first element is the first part of the string, the second is the second part. I would maybe reserved high-tech tuple labels for cases where it's easier to be confused about the meaning of each part (in particular bool values must be labelled or named somehow). But I would be tempted to suggest making them records in that case.

Oh my, new hammers. Please don't.

I think labeled tuples are a fine idea to consider whenever you have a tuple with multiple components of the same type. (Or not the same type, but less so.) That said, given that there is a pretty clear ordering involved in this particular case, I don't personally think labels are called for here.

dbuenzli · 2025-06-05T08:09:40Z

Given (facetiously), that " " has a much more bloated representation than ' ', I'm not sure (non-facetiously) why having a function to cut on a character separator and (perhaps in the future) another function to cut on a string separator constitutes bloat - they are different types?

I'm talking about API and conceptual bloat. Adding more than one function that does essentially the same thing is bloat.

Do we want to have to define two naming schemes for each type of separator ? Do we want users to have to choose between two functions that do the same thing ? Do we want users to have to think about which kind of separator they are going to abstract over (and break all their users when they will realize they abstracted over the wrong one) ?

Especially as at the moment, if we were to add just a function to cut at a string separator we'd have something of an inconsistency between String.split_on_char and String.cut/String.rcut, right?

As usual, being consistent with poor previous design choices should not be a priority.

In fact I fought hard for split_on_char to trim the _on_char part because you know, Windows text files, CSV files, UTF-8 encoded characters etc. But I lost. Mind you it was already an epic battle to convince upstream to actually add a string split function…

gasche · 2025-06-05T08:14:56Z

Maybe there is a reasonable string-search implementation in OCaml somewhere out there (does not need to be the most elaborate algorithm ever), that could be used to propose a string version right now? I think the discussion would be easier if it was between two existing versions, rather that "accepting what has been proposed today vs. asking for unclear amount of work to be done in the future".

dbuenzli · 2025-06-05T09:22:58Z

"accepting what has been proposed today vs. asking for unclear amount of work to be done in the future".

I don't understand why this considered to be the choice here. Stdlib users have worked for the past 29 years without these two functions. Is there is an urge to add these limited interfaces versus doing the right thing? It seems to me that the stdlib is being developped on the ground of doing the right thing, not urgency.

And yes I have been annoyed dozens of times that String.split_on_char only works on char values, it's not the right thing.

(does not need to be the most elaborate algorithm ever)

Naive string search works perfectly well if your needle is short in fact that's what musl's strstr does if it's less than four bytes (with the trick of working with larger words though). So I can easily give you a more generally useful function than the one being proposed here with the caveat that it might be slow if the needle is > 4. At least I'll get to cut these CSV files and HTTP headers into lines.

kit-ty-kate · 2025-06-05T12:29:29Z

(incidentally, it does suggest that perhaps the functions should be cut_on_char / rcut_on_char?)

I think that's fair. I've renamed the functions to that.

As an addition to the list in the PR description, i had forgotten to check the compiler itself:

Misc.Stdlib.cut_at:

ocaml/utils/misc.mli

Line 376 in f7cf03a

val cut_at : string -> char -> string * string

gasche · 2025-06-10T16:39:51Z

My current thinking:

I agree with @dbuenzli that the string-needle implementation is more general and convenient, and that we should try to offer this.
If there was a proposal going in this direction, I wouldn't see much interest in supporting a char version.
But I'm not shocked either by the idea of a final state where we offer both versions (cut and cut_on_char), so I think that merging the char version only if it is the only one on which we all agree would be a reasonable move.

It is tempting to look online for a string-search implementation in OCaml that is reasonably simple while reasonably fast at the same time, benchmark it against the single-char implementation to make sure we are not imposing an unreasonable cost for this common corner case, and submit it as an alternative to the current proposal. But I haven't had the time to consider doing this recently, and apparently I'm not alone in this predicament.

dra27 reviewed Jun 5, 2025

View reviewed changes

Add cut_on_char and rcut_on_char to String/Bytes/StringLabels

a5cf183

kit-ty-kate changed the title ~~Add cut and rcut to String/Bytes/StringLabels~~ Add cut_on_char and rcut_on_char to String/Bytes/StringLabels Jun 5, 2025

kit-ty-kate force-pushed the string-cut branch from dfaadbd to a5cf183 Compare June 5, 2025 12:29

gasche self-assigned this Jun 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cut_on_char and rcut_on_char to String/Bytes/StringLabels #14068

Add cut_on_char and rcut_on_char to String/Bytes/StringLabels #14068

Uh oh!

kit-ty-kate commented Jun 3, 2025

Uh oh!

dbuenzli commented Jun 3, 2025 •

edited

Loading

Uh oh!

kit-ty-kate commented Jun 3, 2025

Uh oh!

dbuenzli commented Jun 3, 2025

Uh oh!

dra27 commented Jun 5, 2025

Uh oh!

dra27 Jun 5, 2025

Uh oh!

gasche Jun 5, 2025

Uh oh!

dbuenzli Jun 5, 2025

Uh oh!

goldfirere Jun 5, 2025

Uh oh!

dbuenzli commented Jun 5, 2025

Uh oh!

gasche commented Jun 5, 2025

Uh oh!

dbuenzli commented Jun 5, 2025

Uh oh!

kit-ty-kate commented Jun 5, 2025 •

edited

Loading

Uh oh!

gasche commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

		@@ -207,6 +207,18 @@ val split_on_char : sep:char -> string -> string list

		@since 4.04 (4.05 in StringLabels) *)

		val cut : sep:char -> string -> (string * string) option

Add cut_on_char and rcut_on_char to String/Bytes/StringLabels #14068

Are you sure you want to change the base?

Add cut_on_char and rcut_on_char to String/Bytes/StringLabels #14068

Uh oh!

Conversation

kit-ty-kate commented Jun 3, 2025

Uh oh!

dbuenzli commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kit-ty-kate commented Jun 3, 2025

Uh oh!

dbuenzli commented Jun 3, 2025

Footnotes

Uh oh!

dra27 commented Jun 5, 2025

Uh oh!

dra27 Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

gasche Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

dbuenzli Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

goldfirere Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

dbuenzli commented Jun 5, 2025

Uh oh!

gasche commented Jun 5, 2025

Uh oh!

dbuenzli commented Jun 5, 2025

Uh oh!

kit-ty-kate commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dbuenzli commented Jun 3, 2025 •

edited

Loading

kit-ty-kate commented Jun 5, 2025 •

edited

Loading

gasche commented Jun 10, 2025 •

edited

Loading