libutil: Use Boost.URL for URI parsing #13445

xokdvium · 2025-07-10T16:39:04Z

Motivation

Boost.URL is a significantly more RFC-compliant parser
than what libutil currently has a bundle of incomprehensible
regexes.

One aspect of this change is that RFC4007 ZoneId IPv6 literals
are represented in URIs according to RFC6874 1.

Previously they were represented naively like so: [fe80::818c:da4d:8975:415c\%enp0s25].
This is not entirely correct, because the percent itself has to be pct-encoded:

"%" is always treated as
an escape character in a URI, so, according to the established URI
syntax [RFC3986] any occurrences of literal "%" symbols in a URI MUST
be percent-encoded and represented in the form "%25". Thus, the
scoped address fe80::a%en1 would appear in a URI as
http://[fe80::a%25en1].

Context

Starting to pay off the tech debt #9603.
Fixes #10898.

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

tests/functional/build-remote-with-mounted-ssh-ng.sh

Ericson2314

I like this a lot!

But going to leave for others to review a bit just case there is some issues e.g. with how much Boost or compat I didn't think of

src/libutil/url.cc

Mic92

Fixed some issue in to_string(), let me know what you think.

src/libutil/url.cc

These cases do not seem to be covered by the test suite at all.

This matcher is useful for checking error messages, which always contain ANSI escapes.

The myriad of hand-rolled URL parsing and validation code is a constant source of problems. Regexes are not a great way of writing parsers and there's a history of getting them wrong. Boost.URL is a good library we can outsource most of the heavy lifting to.

The default comparison operator can be generated by the compiler since C++20.

Boost.URL is a significantly more RFC-compliant parser than what libutil currently has a bundle of incomprehensible regexes. One aspect of this change is that RFC4007 ZoneId IPv6 literals are represented in URIs according to RFC6874 [1]. Previously they were represented naively like so: [fe80::818c:da4d:8975:415c\%enp0s25]. This is not entirely correct, because the percent itself has to be pct-encoded: > "%" is always treated as an escape character in a URI, so, according to the established URI syntax [RFC3986] any occurrences of literal "%" symbols in a URI MUST be percent-encoded and represented in the form "%25". Thus, the scoped address fe80::a%en1 would appear in a URI as http://[fe80::a%25en1]. [1]: https://datatracker.ietf.org/doc/html/rfc6874 Co-authored-by: Jörg Thalheim <joerg@thalheim.io>

xokdvium · 2025-07-19T20:52:02Z

Ran the whole flake-regressions suite and hydraJobs.tests. Best I can tell this doesn't regress anything. What's great is that this also fixed a long-standing issue #10898.

grahamc · 2025-07-19T20:53:52Z

One concern I have is the flake regression suite only tests public flakes, and I suspect some flakes out there will no longer work correctly with new parsing semantics.

…

On Sat, Jul 19, 2025, at 4:52 PM, Sergei Zimmerman wrote: *xokdvium* left a comment (NixOS/nix#13445) <#13445 (comment)> Ran the whole `flake-regressions` suite and `hydraJobs.tests`. Best I can tell this doesn't regress anything. What's great is that this also fixed a long-standing issue #10898 <#10898>. — Reply to this email directly, view it on GitHub <#13445 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASXLH67Q2DEPSFXGBVUSL3JKVYVAVCNFSM6AAAAACBHRJBRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAOJSGU3DKNJWG4>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

xokdvium · 2025-07-19T20:59:19Z

I suspect some flakes out there will no longer work correctly with new parsing semantics.

I did see that Determinate runs GHA CI with a private flake-regressions-data repository.

work correctly with new parsing semantics.

Any particular issues you can think of? Did some of the prior non-compliance get ossified into lock files?
The several major deviations are 1) non pct-encoded spaces (which are handled explicitly in this patchset) and 2) the nested store URI queries, but this is also seemingly handled and tested.

xokdvium requested a review from edolstra as a code owner July 10, 2025 16:39

github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Jul 10, 2025

Ericson2314 reviewed Jul 10, 2025

View reviewed changes

tests/functional/build-remote-with-mounted-ssh-ng.sh Outdated Show resolved Hide resolved

Ericson2314 approved these changes Jul 10, 2025

View reviewed changes

xokdvium force-pushed the simplify-util-url branch 2 times, most recently from ab42ce4 to 77750fd Compare July 11, 2025 22:57

github-actions bot added the documentation label Jul 11, 2025

xokdvium commented Jul 11, 2025

View reviewed changes

src/libutil/url.cc Outdated Show resolved Hide resolved

xokdvium commented Jul 11, 2025

View reviewed changes

src/libutil/url.cc Show resolved Hide resolved

xokdvium force-pushed the simplify-util-url branch from 7140696 to e76b2d6 Compare July 12, 2025 08:36

xokdvium requested review from Ericson2314 and roberth July 12, 2025 10:20

Mic92 force-pushed the simplify-util-url branch from b404e17 to 7d12190 Compare July 17, 2025 14:22

Mic92 approved these changes Jul 17, 2025

View reviewed changes

xokdvium commented Jul 17, 2025

View reviewed changes

src/libutil/url.cc Outdated Show resolved Hide resolved

Mic92 force-pushed the simplify-util-url branch from ec04015 to 7d12190 Compare July 17, 2025 17:32

xokdvium and others added 6 commits July 18, 2025 21:23

lib{store,flake}-tests: Add test for spaces in URIs

ffc9bfb

These cases do not seem to be covered by the test suite at all.

libutil-test-support: Add HasSubstrIgnoreANSIMatcher

d905339

This matcher is useful for checking error messages, which always contain ANSI escapes.

libutil: Use default operator== for ParsedURL

d020f21

The default comparison operator can be generated by the compiler since C++20.

rl-next: Add release note about IPv6 Scoped Addresses in URIs

a54284c

xokdvium force-pushed the simplify-util-url branch from 7d12190 to a54284c Compare July 18, 2025 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

libutil: Use Boost.URL for URI parsing #13445

libutil: Use Boost.URL for URI parsing #13445

Uh oh!

xokdvium commented Jul 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Ericson2314 left a comment

Uh oh!

Uh oh!

Uh oh!

Mic92 left a comment

Uh oh!

Uh oh!

xokdvium commented Jul 19, 2025 •

edited

Loading

Uh oh!

grahamc commented Jul 19, 2025 via email

Uh oh!

xokdvium commented Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

libutil: Use Boost.URL for URI parsing #13445

Are you sure you want to change the base?

libutil: Use Boost.URL for URI parsing #13445

Uh oh!

Conversation

xokdvium commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

Uh oh!

Uh oh!

Ericson2314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mic92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xokdvium commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grahamc commented Jul 19, 2025 via email

Uh oh!

xokdvium commented Jul 19, 2025

Uh oh!

Uh oh!

xokdvium commented Jul 10, 2025 •

edited

Loading

xokdvium commented Jul 19, 2025 •

edited

Loading