Skip to content

Allow using temporary staging path in Iceberg for writing sorted files #26172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

agoyal-sfdc
Copy link
Member

@agoyal-sfdc agoyal-sfdc commented Jul 10, 2025

Description

This change adds support for using temporary staging directory during write operations involving sorted tables. Writes to sorted tables will utilize this path for staging temporary files during sorting operation. When disabled, the target storage will be used for staging while writing sorted tables which can be inefficient when writing to object stores like S3.

Adding session properties for these, as well as for the properties that control the number and target size of buffer files. This is useful because files of different sizes/number can be configured per query, so they can be tuned for the type of data in a given table.

Additional context and related issues

Fixes #24376
Similar to functionality added for Hive in #3434

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Support using temporary staging path for writing sorted files. ({[issue](https://github.com/trinodb/trino/issues/24376)}`24376`)

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for a configurable temporary staging directory when writing sorted files in Iceberg, improving performance on object stores.

  • Introduces temporaryStagingDirectoryEnabled and temporaryStagingDirectoryPath in IcebergConfig and exposes them as session properties.
  • Updates IcebergSessionProperties and IcebergPageSink to select between the staging directory and default temp path per session.
  • Expands tests and documentation to cover the new staging directory behavior.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java Add new config properties and their setters/getters
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSessionProperties.java Register new session properties and provide accessors
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java Implement logic to choose staging directory or default temp path
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSinkProvider.java Remove unused SortingFileWriterConfig injection
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/*.java Update tests to include SortingFileWriterConfig and new session props
docs/src/main/sphinx/connector/iceberg.md Document the new temporary staging directory properties
Comments suppressed due to low confidence (2)

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java:90

  • [nitpick] The field name includes the 'is' prefix which duplicates the boolean getter style. Consider renaming it to temporaryStagingDirectoryEnabled to avoid confusion between field and getter naming.
    private boolean isTemporaryStagingDirectoryEnabled;

docs/src/main/sphinx/connector/iceberg.md:220

  • [nitpick] The list formatting uses * - which differs from surrounding items. Align these entries with the existing Sphinx list style for consistency and readability.
* - `iceberg.temporary-staging-directory-enabled`

if (location.scheme().isPresent()) {
return location;
}
return Location.of("file:///" + location.path());
Copy link
Preview

Copilot AI Jul 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Concatenating the scheme with the path may produce extra slashes (e.g., file:////tmp/...). Consider using a URI builder or checking if the path already starts with a '/' to normalize to a single 'file://' prefix.

Suggested change
return Location.of("file:///" + location.path());
String normalizedPath = location.path().startsWith("/") ? "file://" + location.path() : "file:///" + location.path();
return Location.of(normalizedPath);

Copilot uses AI. Check for mistakes.

Copy link
Member Author

@agoyal-sfdc agoyal-sfdc Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to the function in HiveWriterFactory:

static Location setSchemeToFileIfAbsent(Location location)
{
if (location.scheme().isPresent()) {
return location;
}
return Location.of("file:///" + location.path());
}

@agoyal-sfdc agoyal-sfdc force-pushed the iceberg_sort_temp_dir branch from 9280531 to 1c37cdf Compare July 16, 2025 23:07
@agoyal-sfdc agoyal-sfdc force-pushed the iceberg_sort_temp_dir branch from 1c37cdf to 80d1253 Compare July 18, 2025 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Large insert into sorted Iceberg table fails
1 participant