Wednesday, February 06, 2013

File upload and copy pattern

I was asked a different question; but then ended up with this instead. Shame to waste it...

The pattern that I've used in the past has directories like this (under some root directory like /var/lib/uploads):

/partial
/ready
/working
/success
/error

This set of directories are required for all traffic to a given recipient. Essentially, a file makes its way through the directories, from top to bottom. All directories should be on a single file system. File uploaders/clients should be able to write to /partial and /ready. File receivers/servers should be able to write to everything, except /partial.

Step 1: A file is uploaded/copied into the /partial directory; with a (globally) unique file name. This step completes when there's sufficient confidence that the file has been copied (usually that just means that the expected number of bytes has been written without an error being thrown).
Step 2:  The file uploader/client moves the newly uploaded file into the /ready directory. DO NOT COPY THE FILE!!!! In general, moving/renaming a file within a file system is guaranteed to be an atomic operation, but copying is not. This signifies that the file is ready (from the perspective of the client).
Step 3: When the recipient application/process is ready to process a file in /ready, it should first move the file into its /working directory.
Step 4: When the recipient application has finished processing a file (due to completion, or error) it should move the file into the /success, or /error, directory.

Things to watch
- more than one file in the working directory is likely to indicate a failure.
- Any file in the error directory is likely to indicate a failure.
- "old" files in the partial directory indicate unsuccessful copies/uploads.
- "old" files in the ready directory indicate that processing has failed/slowed.
- "old" files in the working directory indicate a failure/ABEND.
- Make sure the file system doesn't fill.
- Archiving is not covered here; that's a different pattern.

Other notes;
- If there is more than one recipient (e.g.: a multi-process server) there should be multiple working directories (working01 working02, working03, etc. - one for each process.
- this is essentially a queue implementation with single phase commit transactions.
- You can implement exactly the same pattern using file renaming, rather than separate directories. I prefer directories.

No comments: