Skip to content

S3 Output does not clean up "empty" gzip fragments during restore #56

@yaauie

Description

@yaauie
  • When the S3 output plugin is started with restore => true and encoding => gzip after a crash,
  • AND the previous run had left behind GZIP files in its temporary_directory whose uncompressed representation is an empty string
  • The restore process leaves behind temporary zero-length files and fails to clean up the segment.
  • Subsequent starts of the plugin observe the previously-left temporary uncompressed representations and log noisily:
    [2025-05-21T14:27:20,179][WARN ][logstash.outputs.s3      ][main] The ${TEMPORARY_DIRECTORY}/${FILE_ID}/${NESTED_FILENAME}-recovered.txt file either under recover process or failed to recover before.
    

    WHERE:

    • TEMPORARY_DIRECTORY is the plugin's temporary_directory
    • FILE_ID is a UUID-based directory only containing components of a single object destined to s3
    • NESTED_FILENAME is the object-id in a nested directory structure

The continued presence of these *-restored.txt files prevents further attempts to "restore" the gzip file in question (which is empty, and therefore does NOT need to be restored), and results in log noise each time the plugin is started.

Workaround

The orphaned zero-byte -recovered.txt files and their associated 20-byte *.txt.gz files within a given temporary_directory can be safely deleted when there are no active pipelines processing with that temporary_directory.

  1. Ensure no pipelines are actively processing from the given temporary_directory
  2. CD into the temporary_directory
  3. List the empty recovered files and their associated 20-byte effectively-empty gzips:
    for recovered_tmp in $(find . -name '*-recovered.txt' -empty -print); do (echo $recovered_tmp; find "$(sed 's/-recovered.txt$/.txt.gz/' <<<"${recovered_tmp}")" -exec bash -c '(($(stat -f %z {}) <= 20))' ';' -print) | xargs ls -la ; done
    Example output:
    -rw-r--r--@ 1 logstash  logstash   0 Jun  6 18:51 ./583d9853-d56e-45ec-bd75-c61d4e3dc57e/path/to/object-recovered.txt
    -rw-r--r--@ 1 logstash  logstash  20 Jun  6 18:48 ./583d9853-d56e-45ec-bd75-c61d4e3dc57e/path/to/object.txt.gz
    -rw-r--r--@ 1 logstash  logstash   0 Jun  6 18:51 ./76f825d8-f05e-4dcd-9e5c-23b84a906052/path/to/object-recovered.txt
    -rw-r--r--@ 1 logstash  logstash  20 Jun  6 18:48 ./76f825d8-f05e-4dcd-9e5c-23b84a906052/path/to/object.txt.gz
    
  4. Delete the offending files.

Analysis of the bug:

Each target file gets a distinct directory (which I'll call file_id) inside of the plugin's temporary_directory, in which its nested-by-path-name file can live (whether that is a *.txt or *.txt.gz). This holder-directory is deleted once the file has been uploaded in the normal course of things or after a successful "recovery" upload.

During recovery, we:

  • observe file ${temporary_directory}/${file_id}/${nested_filename}.txt.gz (file is 20 bytes, and only has the GZIP header 1f8b 0800 0000 0000 00ff 0300 0000 0000 0000 0000)
  • decompress to ${temporary_directory}/${file_id}/${nested_filename}.txt, which has zero bytes
  • skip recompressing to ${temporary_directory}/${file_id}/${nested_filename}-recovered.txt.gz because ${temporary_directory}/${file_id}/${nested_filename}-recovered.txt was empty
  • observe that ${temporary_directory}/${file_id}/${nested_filename}-recovered.txt.gz doesn't exist on disk, which makes it "unrecoverable" so we do not clean up ${temporary_directory}/${file_id}

The next time the plugin is started, it observes the -recovered.txt and logs about it being in a mid- or failed-recovery state; the associated *.txt.gz file is recovered again, but doing so has the same effect.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions