S3 Output does not clean up "empty" gzip fragments during restore

 - When the S3 output plugin is started with `restore => true` and `encoding => gzip` after a crash,
 - AND the previous run had left behind GZIP files in its `temporary_directory` whose _uncompressed representation_ is an empty string
 - The restore process leaves behind _temporary_ zero-length files and fails to clean up the segment.
 - _Subsequent_ starts of the plugin observe the previously-left temporary uncompressed representations and log noisily:
   > ~~~
   > [2025-05-21T14:27:20,179][WARN ][logstash.outputs.s3      ][main] The ${TEMPORARY_DIRECTORY}/${FILE_ID}/${NESTED_FILENAME}-recovered.txt file either under recover process or failed to recover before.
   > ~~~
   > WHERE:
   >  - `TEMPORARY_DIRECTORY` is the plugin's `temporary_directory`
   >  - `FILE_ID` is a UUID-based directory only containing components of a single object destined to s3
   >  - `NESTED_FILENAME` is the object-id in a nested directory structure

The continued presence of these `*-restored.txt` files prevents further attempts to "restore" the gzip file in question (which is empty, and therefore does NOT need to be restored), and results in log noise each time the plugin is started.

## Workaround

The orphaned zero-byte `-recovered.txt` files and their associated 20-byte `*.txt.gz` files within a given `temporary_directory` can be safely deleted _when there are no active pipelines processing with that `temporary_directory`_.

0. Ensure no pipelines are actively processing from the given `temporary_directory`
1. CD into the `temporary_directory`
2. List the _empty_ recovered files and their associated 20-byte effectively-empty gzips:
   ~~~ bash
   for recovered_tmp in $(find . -name '*-recovered.txt' -empty -print); do (echo $recovered_tmp; find "$(sed 's/-recovered.txt$/.txt.gz/' <<<"${recovered_tmp}")" -exec bash -c '(($(stat -f %z {}) <= 20))' ';' -print) | xargs ls -la ; done
   ~~~
   Example output:
   > ~~~
   > -rw-r--r--@ 1 logstash  logstash   0 Jun  6 18:51 ./583d9853-d56e-45ec-bd75-c61d4e3dc57e/path/to/object-recovered.txt
   > -rw-r--r--@ 1 logstash  logstash  20 Jun  6 18:48 ./583d9853-d56e-45ec-bd75-c61d4e3dc57e/path/to/object.txt.gz
   > -rw-r--r--@ 1 logstash  logstash   0 Jun  6 18:51 ./76f825d8-f05e-4dcd-9e5c-23b84a906052/path/to/object-recovered.txt
   > -rw-r--r--@ 1 logstash  logstash  20 Jun  6 18:48 ./76f825d8-f05e-4dcd-9e5c-23b84a906052/path/to/object.txt.gz
   > ~~~
3. Delete the offending files.

## Analysis of the bug:

Each target file gets a distinct directory (which I'll call `file_id`) inside of the plugin's `temporary_directory`, in which its nested-by-path-name file can live (whether that is a `*.txt` or `*.txt.gz`). This holder-directory is deleted once the file has been uploaded in the normal course of things or after a successful "recovery" upload. 

During recovery, we:
 - observe file `${temporary_directory}/${file_id}/${nested_filename}.txt.gz` (file is 20 bytes, and only has the GZIP header `1f8b 0800 0000 0000 00ff 0300 0000 0000
0000 0000`)
 - decompress to `${temporary_directory}/${file_id}/${nested_filename}.txt`, which has zero bytes
 - _skip_ recompressing to `${temporary_directory}/${file_id}/${nested_filename}-recovered.txt.gz` because `${temporary_directory}/${file_id}/${nested_filename}-recovered.txt` was empty
 - observe that `${temporary_directory}/${file_id}/${nested_filename}-recovered.txt.gz` doesn't exist on disk, which makes it "unrecoverable" so we do _not_ clean up `${temporary_directory}/${file_id}`

The next time the plugin is started, it observes the `-recovered.txt` and logs about it being in a mid- or failed-recovery state; the associated `*.txt.gz` file _is_ recovered again, but doing so has the same effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S3 Output does not clean up "empty" gzip fragments during restore #56

Workaround

Analysis of the bug:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S3 Output does not clean up "empty" gzip fragments during restore #56

Description

Workaround

Analysis of the bug:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions