Skip to content

Add cut_on_char and rcut_on_char to String/Bytes/StringLabels #14068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

kit-ty-kate
Copy link
Member

String.cut is a function i recently had to reimplement in every single one of my personal [1] and work projects.
I feel like this is especially useful when for example parsing Unix.environment and similar things, where String.split_on_char is inappropriate at worst and inefficient at best, given that the right side of an environment binding can very well contain multiple = characters.

In terms of naming, looking at sherlocode.com quickly:

[1]: https://codeberg.org/kit-ty-kate/ocaml-posixutils/src/commit/49ba0e1b3c6ccd0421047bccaf96a6e92031f647/src/env.ml#L1 and 2 other unpublished projects.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 3, 2025

String separators please. API, Data.

@kit-ty-kate
Copy link
Member Author

String separators please. API, Data.

Why not both? A function working with string separators requires more effort to implement efficiently and in a modular manner. I think it's better done later. I'm not sure i would have the time or interest at the moment to implement that anyway.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 3, 2025

Why not both?

API bloat.

I think it's better done later. I'm not sure i would have the time or interest at the moment to implement that anyway.

I disagree, it's not because you don't have time that we should add bloat to the stdlib.

In fact I have a TODO to upstream the various extremely useful1 substring functions I have in my extended String module so that it can disappear in most of my programs.

I didn't yet because upstream would likely find naive search distasteful and I have been too lazy to implement something better (that was my candidate) and for my light usages naive string search is entirely "good enough".

But perhaps something I'd be willing to procrastinate on for the next release if I feel some good code vibes from upstream.

Footnotes

  1. Well that one was proposed and rejected, which is extremely sad as it drastically cut down my index out of bounds errors. API usability is difficult to convey in writing.

@dra27
Copy link
Member

dra27 commented Jun 5, 2025

Given (facetiously), that " " has a much more bloated representation than ' ', I'm not sure (non-facetiously) why having a function to cut on a character separator and (perhaps in the future) another function to cut on a string separator constitutes bloat - they are different types?

Especially as at the moment, if we were to add just a function to cut at a string separator we'd have something of an inconsistency between String.split_on_char and String.cut/String.rcut, right?

(incidentally, it does suggest that perhaps the functions should be cut_on_char / rcut_on_char?)

@@ -207,6 +207,18 @@ val split_on_char : sep:char -> string -> string list

@since 4.04 (4.05 in StringLabels) *)

val cut : sep:char -> string -> (string * string) option
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this would be a good first function to use a labelled tuple in the result 🫣 (@ccasin, @goldfirere, @OlivierNicole - any JS examples, even potentially against doing this?) e.g. key/value, first/second, prefix/suffix, left/right?

If we were to go with a labelled tuple, I commit to sorting out #11792 (i.e. making syncing the docs work for this...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this it's reasonably obvious which element is what: the first element is the first part of the string, the second is the second part. I would maybe reserved high-tech tuple labels for cases where it's easier to be confused about the meaning of each part (in particular bool values must be labelled or named somehow). But I would be tempted to suggest making them records in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my, new hammers. Please don't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think labeled tuples are a fine idea to consider whenever you have a tuple with multiple components of the same type. (Or not the same type, but less so.) That said, given that there is a pretty clear ordering involved in this particular case, I don't personally think labels are called for here.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 5, 2025

Given (facetiously), that " " has a much more bloated representation than ' ', I'm not sure (non-facetiously) why having a function to cut on a character separator and (perhaps in the future) another function to cut on a string separator constitutes bloat - they are different types?

I'm talking about API and conceptual bloat. Adding more than one function that does essentially the same thing is bloat.

Do we want to have to define two naming schemes for each type of separator ? Do we want users to have to choose between two functions that do the same thing ? Do we want users to have to think about which kind of separator they are going to abstract over (and break all their users when they will realize they abstracted over the wrong one) ?

Especially as at the moment, if we were to add just a function to cut at a string separator we'd have something of an inconsistency between String.split_on_char and String.cut/String.rcut, right?

As usual, being consistent with poor previous design choices should not be a priority.

In fact I fought hard for split_on_char to trim the _on_char part because you know, Windows text files, CSV files, UTF-8 encoded characters etc. But I lost. Mind you it was already an epic battle to convince upstream to actually add a string split function…

@gasche
Copy link
Member

gasche commented Jun 5, 2025

Maybe there is a reasonable string-search implementation in OCaml somewhere out there (does not need to be the most elaborate algorithm ever), that could be used to propose a string version right now? I think the discussion would be easier if it was between two existing versions, rather that "accepting what has been proposed today vs. asking for unclear amount of work to be done in the future".

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 5, 2025

"accepting what has been proposed today vs. asking for unclear amount of work to be done in the future".

I don't understand why this considered to be the choice here. Stdlib users have worked for the past 29 years without these two functions. Is there is an urge to add these limited interfaces versus doing the right thing? It seems to me that the stdlib is being developped on the ground of doing the right thing, not urgency.

And yes I have been annoyed dozens of times that String.split_on_char only works on char values, it's not the right thing.

(does not need to be the most elaborate algorithm ever)

Naive string search works perfectly well if your needle is short in fact that's what musl's strstr does if it's less than four bytes (with the trick of working with larger words though). So I can easily give you a more generally useful function than the one being proposed here with the caveat that it might be slow if the needle is > 4. At least I'll get to cut these CSV files and HTTP headers into lines.

@kit-ty-kate kit-ty-kate changed the title Add cut and rcut to String/Bytes/StringLabels Add cut_on_char and rcut_on_char to String/Bytes/StringLabels Jun 5, 2025
@kit-ty-kate
Copy link
Member Author

kit-ty-kate commented Jun 5, 2025

(incidentally, it does suggest that perhaps the functions should be cut_on_char / rcut_on_char?)

I think that's fair. I've renamed the functions to that.

As an addition to the list in the PR description, i had forgotten to check the compiler itself:

@gasche gasche self-assigned this Jun 10, 2025
@gasche
Copy link
Member

gasche commented Jun 10, 2025

My current thinking:

  • I agree with @dbuenzli that the string-needle implementation is more general and convenient, and that we should try to offer this.
  • If there was a proposal going in this direction, I wouldn't see much interest in supporting a char version.
  • But I'm not shocked either by the idea of a final state where we offer both versions (cut and cut_on_char), so I think that merging the char version only if it is the only one on which we all agree would be a reasonable move.

It is tempting to look online for a string-search implementation in OCaml that is reasonably simple while reasonably fast at the same time, benchmark it against the single-char implementation to make sure we are not imposing an unreasonable cost for this common corner case, and submit it as an alternative to the current proposal. But I haven't had the time to consider doing this recently, and apparently I'm not alone in this predicament.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants