One of our goals at GEO Analytics Canada is to develop an approach to utilize Kubernetes to process and store large amounts of data in S3-like buckets in a scalable manner. We are interested in using Filesystem in Userspace (FUSE) modules to mount S3 buckets as a folder on the compute file system. This greatly simplifies the process to scale up our current workflows because applications that are designed to work on a local machine can now interact with cloud storage as if it is another folder in the file system. Our candidates for an open source FUSE implementation were GCSFuse and Goofys. Here are the results of comparing their performance and usability.
As a test case for our performance evaluations, we decided to utilize Sen2Cor to process earth observation data that resides on cloud storage. Out-of-the-box, Sen2Cor processed 1 granule at a time and used a fair amount of resources on our local machines (roughly 1-2 cores, 4-6gb of RAM and took 25-45 minutes depending on which version of Sen2Cor was used). Sen2Cor seemed like a good candidate for the cloud as we could scale the number of compute nodes based on how many instances we wanted to run, allowing us to process many granules concurrently (i.e. in parallel). The input data for Sen2Cor was available on the public google cloud storage bucket.
GCSFuse with direct file access
We had estimated that running Sen2Cor in parallel using FUSE mounted buckets would take a few extra minutes due to reading/writing directly into a bucket rather than local storage. Our first choice of FUSE implementations was GCSFuse because it was released by the Google Cloud Platform team and we were using Google Cloud Storage (GCS) buckets. We copied some input data from the GCS public bucket into our own private GCS bucket then mounted that with GCSFuse. When running a single Sen2Cor (v2.5.5) instance that was reading and writing directly to the mounted bucket, the process took 3 hours and 18 minutes. When running three sen2cor instances in parallel on the same set of inputs, it took 3 hours and 32 minutes which was quite unexpected (we expected that parallel execution would have been much faster than serial). We found that the reason parallel processing wasn’t faster was because Sen2Cor was doing a lot of random read/writes which GCSFuse obviously had problems with.
GCSFuse with file copy then process
Instead of reading/writing directly into the bucket, our next approach copied the same test data from the public bucket into local storage, launching an instance of Sen2Cor with the local data, then transferring the processed results back into a bucket. Although there are more steps to this approach, the elimination of random read/writes to mounted bucket significantly reduced process time. Here are the results of the average time it took to transfer data with GCSFuse:
18s to copy 1 L1C input folder (779mb) from the mounted bucket to the home directory
27s to copy 1 L1C input folder (779mb) from the home directory to the mounted bucket
29s to copy 2 L1C input folders (SAFE Files) simultaneously to the mounted bucket in the same container.
Goofys with file copy then process
As well, we decided to try another FUSE solution to compare its performance with GCSFuse. We came across Goofys which was able to treat GCS as an S3-like provider due to GCS’s interoperability. We ran the same tests as we did previously but mounted our buckets with Goofys. Here are the results of the average time it took to transfer data with Goofys.
10s to copy 1 L1C input folder (SAFE File) from the mounted bucket to the home directory (779mb)
22s to copy 1 L1C input folder (SAFE File) from the home directory to the mounted bucket (779mb)
23s to copy 2 L1C input folders (SAFE Files) simultaneously to the mounted bucket in the same container.
GCSFuse vs. Goofys
A difference between the two FUSE implementations is that GCSFuse supports both sequential and random writes, but not atomic renaming. Goofys supports atomic renaming but not random writes. These were important limitations for us to know because Sen2Cor (v2.5.5) does random writes, and Sen2Cor (v2.8.0) does both random writes and renames files and folders. This restricted our ability to directly interact with the mounted buckets.
When using GCSFuse, there were times when transferring data consistently took up to 12 minutes when copying a folder that contained 800mb of files. This seemed to be the case when writing files to the bucket, then doing repeated copy operations. With Goofys, there was never a delay of more than 10-15 seconds outside of the base times shown above. One other benefit of Goofys was that it supported many S3 providers which gave us flexibility with choosing our storage provider. For these reasons, we decided to move forward with using Goofys as our FUSE implementation.
GCSFuse | Goofys | |
Consistent Data Transfer Times | No | Yes |
Average Data Transfer Time from Local Filesystem to Bucket (779mb) | 27 seconds | 22 seconds |
Supports Random Writes | Yes, but very poorly | No |
Average Data Transfer Time from Bucket to Local Filesystem (779mb) | 18 seconds; highly variable transfer time | 10 seconds; very regular transfer times |
Supports Multiple S3 Providers | No, only Google Cloud Storage | Yes |
Final Implementation
After upgrading our version of Sen2Cor to 2.8.0 and switching our FUSE implementation to Goofys, we were able to simultaneously run 10 instances of Sen2Cor and finish processing all 10 input granules in 27 minutes and 58 seconds, the time it normally takes to process one file locally. The total time included the provisioning of infrastructure to handle the resources request by containers, start up time (mounting the buckets, transferring the data from the public bucket into local storage) and clean up time (transferring the processed data into our private bucket and unmounting buckets). We believe this approach to be highly scalable as each instance of Sen2Cor runs in a separate container in its own pod and doesn’t affect the time of any other pod.