Skip to content

libutil: Use Boost.URL for URI parsing #13445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

xokdvium
Copy link
Contributor

@xokdvium xokdvium commented Jul 10, 2025

Motivation

Boost.URL is a significantly more RFC-compliant parser
than what libutil currently has a bundle of incomprehensible
regexes.

One aspect of this change is that RFC4007 ZoneId IPv6 literals
are represented in URIs according to RFC6874 1.

Previously they were represented naively like so: [fe80::818c:da4d:8975:415c\%enp0s25].
This is not entirely correct, because the percent itself has to be pct-encoded:

"%" is always treated as
an escape character in a URI, so, according to the established URI
syntax [RFC3986] any occurrences of literal "%" symbols in a URI MUST
be percent-encoded and represented in the form "%25". Thus, the
scoped address fe80::a%en1 would appear in a URI as
http://[fe80::a%25en1].

Context

Starting to pay off the tech debt #9603.
Fixes #10898.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@xokdvium xokdvium requested a review from edolstra as a code owner July 10, 2025 16:39
@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Jul 10, 2025
Copy link
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot!

But going to leave for others to review a bit just case there is some issues e.g. with how much Boost or compat I didn't think of

@xokdvium xokdvium force-pushed the simplify-util-url branch 2 times, most recently from ab42ce4 to 77750fd Compare July 11, 2025 22:57
@xokdvium xokdvium force-pushed the simplify-util-url branch from 7140696 to e76b2d6 Compare July 12, 2025 08:36
@xokdvium xokdvium requested review from Ericson2314 and roberth July 12, 2025 10:20
@Mic92 Mic92 force-pushed the simplify-util-url branch from b404e17 to 7d12190 Compare July 17, 2025 14:22
Copy link
Member

@Mic92 Mic92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed some issue in to_string(), let me know what you think.

@Mic92 Mic92 force-pushed the simplify-util-url branch from ec04015 to 7d12190 Compare July 17, 2025 17:32
xokdvium and others added 6 commits July 18, 2025 21:23
These cases do not seem to be covered by the test suite at all.
This matcher is useful for checking error messages, which
always contain ANSI escapes.
The myriad of hand-rolled URL parsing and validation code
is a constant source of problems. Regexes are not a great way
of writing parsers and there's a history of getting them wrong.
Boost.URL is a good library we can outsource most of the heavy
lifting to.
The default comparison operator can be generated
by the compiler since C++20.
Boost.URL is a significantly more RFC-compliant parser
than what libutil currently has a bundle of incomprehensible
regexes.

One aspect of this change is that RFC4007 ZoneId IPv6 literals
are represented in URIs according to RFC6874 [1].

Previously they were represented naively like so: [fe80::818c:da4d:8975:415c\%enp0s25].
This is not entirely correct, because the percent itself has to be pct-encoded:

> "%" is always treated as
   an escape character in a URI, so, according to the established URI
   syntax [RFC3986] any occurrences of literal "%" symbols in a URI MUST
   be percent-encoded and represented in the form "%25".  Thus, the
   scoped address fe80::a%en1 would appear in a URI as
   http://[fe80::a%25en1].

[1]: https://datatracker.ietf.org/doc/html/rfc6874

Co-authored-by: Jörg Thalheim <joerg@thalheim.io>
@xokdvium xokdvium force-pushed the simplify-util-url branch from 7d12190 to a54284c Compare July 18, 2025 18:25
@xokdvium
Copy link
Contributor Author

xokdvium commented Jul 19, 2025

Ran the whole flake-regressions suite and hydraJobs.tests. Best I can tell this doesn't regress anything. What's great is that this also fixed a long-standing issue #10898.

@grahamc
Copy link
Member

grahamc commented Jul 19, 2025 via email

@xokdvium
Copy link
Contributor Author

I suspect some flakes out there will no longer work correctly with new parsing semantics.

I did see that Determinate runs GHA CI with a private flake-regressions-data repository.

work correctly with new parsing semantics.

Any particular issues you can think of? Did some of the prior non-compliance get ossified into lock files?
The several major deviations are 1) non pct-encoded spaces (which are handled explicitly in this patchset) and 2) the nested store URI queries, but this is also seemingly handled and tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation with-tests Issues related to testing. PRs with tests have some priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nix-copy-closure to ssh:// store tries to log in with empty SSH username
5 participants