Rclone - rsync for cloud storage
Rclone can be used to copy files from/to their Microsoft OneDrive or Google Drive cloud storage to/from HPCC disk space. This tool can also be used to mount a user's cloud storage to their HPCC disk so that the storage on cloud could be used as extended disk space.
Rclone is installed on HPCC system wide. To use it, users should first load the software module into their environment using command:
module load rclone
For more details of using rclone, users can visit Rclone web site at https://rclone.org/.
To start using Rclone, users need to run the following command to configure it:
rclone config
The instructions for this command could be found at https://rclone.org/commands/rclone_config/.
Specifically, to configure for Google Drive, see https://rclone.org/drive/, and to configure for Microsoft Onedrive, see https://rclone.org/onedrive/ for instructions. The specific details of how to start using this software on HPCC could be found in the document Rclone.pdf
After successfully configuring the software, users should be able to use "rclone" command to copy or mount the cloud storage to HPCC. There are many rclone sub-commands that can be used to handle file transfers and manage files on HPCC and cloud storage. To get help, use "rclone --help" as shown below:
[hpc@dev-intel16-k80 ~]$ module load rclone
[hpc@dev-intel16-k80 ~]$ rclone --help
Rclone syncs files to and from cloud storage providers as well as
mounting them, listing them in lots of different ways.
See the home page (https://rclone.org/) for installation, usage,
documentation, changelog and configuration walkthroughs.
Usage:
rclone [flags]
rclone [command]
Available Commands:
about Get quota information from the remote.
authorize Remote authorization.
cachestats Print cache stats for a remote
cat Concatenates any files and sends them to stdout.
check Checks the files in the source and destination match.
cleanup Clean up the remote if possible
config Enter an interactive configuration session.
copy Copy files from source to dest, skipping already copied
copyto Copy files from source to dest, skipping already copied
copyurl Copy url content to dest.
cryptcheck Cryptcheck checks the integrity of a crypted remote.
cryptdecode Cryptdecode returns unencrypted file names.
dbhashsum Produces a Dropbox hash file for all the objects in the path.
dedupe Interactively find duplicate files and delete/rename them.
delete Remove the contents of path.
deletefile Remove a single file from remote.
genautocomplete Output completion script for a given shell.
gendocs Output markdown docs for rclone to the directory supplied.
hashsum Produces an hashsum file for all the objects in the path.
help Show help for rclone commands, flags and backends.
link Generate public link to file/folder.
listremotes List all the remotes in the config file.
ls List the objects in the path with size and path.
lsd List all directories/containers/buckets in the path.
lsf List directories and objects in remote:path formatted for parsing
lsjson List directories and objects in the path in JSON format.
lsl List the objects in path with modification time, size and path.
md5sum Produces an md5sum file for all the objects in the path.
mkdir Make the path if it does not already exist.
mount Mount the remote as file system on a mountpoint.
move Move files from source to dest.
moveto Move file or directory from source to dest.
ncdu Explore a remote with a text based user interface.
obscure Obscure password for use in the rclone.conf
purge Remove the path and all of its contents.
rc Run a command against a running rclone.
rcat Copies standard input to file on remote.
rcd Run rclone listening to remote control commands only.
rmdir Remove the path if empty.
rmdirs Remove empty directories under the path.
serve Serve a remote over a protocol.
settier Changes storage class/tier of objects in remote.
sha1sum Produces an sha1sum file for all the objects in the path.
size Prints the total size and number of objects in remote:path.
sync Make source and dest identical, modifying destination only.
touch Create new file or change file modification time.
tree List the contents of the remote in a tree like fashion.
version Show the version number.
Use "rclone [command] --help" for more information about a command.
Use "rclone help flags" for to see the global flags.
Use "rclone help backends" for a list of supported services.
[hpc@dev-intel16-k80 ~]$
The tool "cloudSync" was developed to help user to synchronize the files between their cloud storages. It is accessible through "powertools" which should automatically loaded upon logging into HPCC, but can be manually loaded with 'ml load powertools' if need be. Users are welcome to try it and report any problems to us via contact form here.
Following are a few examples of running rclone commands after successfully having configured the cloud storage. Assume that the cloud storage is configured as the name "MyOneDrive".
(1) See current remote storage
We can check the current configuration of rclone using 'rclone config'. As is shown below, we can see that there are currently two remote cloud storage configured: "MyOneDrive" and "googledoc"
[user@dev-intel18 ~]$ rclone config
Current remotes:
Name Type
==== ====
MyOneDrive onedrive
googledoc drive
e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
[user@dev-intel18 ~]$
(2) Check the remote storage information
We can see the remote storage usage and quota using "rclone about" command.
[user@dev-intel16-k80 ~]$ rclone about MyOneDrive:
Total: 5T
Used: 450.999M
Free: 4.998T
Trashed: 404.576k
(3) List the contents of the cloud storage
[user@dev-intel18 ~]$ rclone lsd MyOneDrive:
-1 2018-02-02 08:57:54 0 Attachments
-1 2019-08-27 15:43:33 1 IMAGES
-1 2019-08-22 15:50:10 42 Matlab
-1 2019-02-26 17:12:01 16 Microsoft Teams Chat Files
-1 2018-08-24 08:56:32 1 Notebooks
(4) Copy files on HPCC to remote cloud:
[user@dev-intel18 ~]$ rclone copy Project MyOneDrive:Project # copy the content of directory "Project" to remote cloud storage
[user@dev-intel18 ~]$ rclone lsd MyOneDrive: # view the contents of cloud storage to confirm the copy
-1 2018-02-02 08:57:54 0 Attachments
-1 2019-08-27 15:43:33 1 IMPACT
-1 2019-08-22 15:50:10 42 Matlab
-1 2019-02-26 17:12:01 16 Microsoft Teams Chat Files
-1 2018-08-24 08:56:32 1 Notebooks
-1 2020-04-27 15:43:25 2 Project
[user@dev-intel18 ~]$ rclone lsd MyOneDrive:Project
-1 2020-04-27 15:44:39 1 GPAW
-1 2020-04-27 15:43:26 3 MATLAB
(5) Copy files on cloud storage to HPCC:
[user@dev-intel18 Project]$ ls # current content of Project directory before copy
GPAW MATLAB
[user@dev-intel18 Project]$ rclone copy MyOneDrive:IMPACT ./ # copy the content of IMPACT in cloud to current directory
[user@dev-intel18 Project]$ ls # confirm that the copy is done
GPAW impact_run MATLAB
Note
Although "rclone copy" is similar as unix commands rsync and cp, when using it, users should be aware of the differences and know the details of its behavior.
(1) "rclone copy" does not transfer unchanged files, testing by size and modification time or MD5SUM. In this sense, it is similar as linux command rsync;
(2) When running
"rclone copy source:sourcepath dest:destpath
", if source:sourcepath
is
a directory, dest:destpath
should also be a directory. It does not
copy the directory source:sourcepath
, instead, it will copy the
content of the directory source:sourcepath
to the destination
dest:destpath
. If dest:destpath
does not exist, it will be created
and the content of source:sourcepath
will be stored in it.
(3) "rclone copyto" is a very similar rclone command to "rclone copy". The
only difference is that it can be used to upload single files to files other
than their current name. When running
"rclone copyto source:sourcepath dest:destpath
", if source:sourcepath
is
a file, dest:destpath
could be a new file name.
If source:sourcepath
is a directory, it would be the same as using
"rclone copy".
(6) Checks the files in the source and destination match.
[user@dev-intel18 Project]$ rclone check impact_run MyOneDrive:IMPACT/impact_run # check if it is matched both sides
2020/04/27 16:19:01 NOTICE: One drive root 'IMPACT/impact_run': 0 differences found
2020/04/27 16:19:01 NOTICE: One drive root 'IMPACT/impact_run': 21 matching files
Note
For archiving your files to your cloud storage, if the connection between HPCC and your cloud storage is not stable, we would NOT recommend using "rclone move" because it may loss the data during the transfer. Instead, we recommend using "rclone copy" to successfully copy the files over and run "rclone check" to check if files are identical. After that, it is safe to delete local copy of the files.
Note
When using "rclone mount" command to mount your cloud storage to HPCC, there are two things users should be careful:
(1) When running rclone mount, the process runs NOT as the user, instead, it runs as a "root" of the cloud storage. Therefore, user may see the error message like "mount helper error: fusermount: failed to open mountpoint for reading: Permission denied". User could use /tmp space for mount point because that space is accessible for all users. Users should be very careful to open the permission to others for the purpose of using rclone mount.
(2) The "rclone mount" users should unmount it after use using "fusermount -u \<endpoint_dir>". Note that sometimes the endpoint is not unmounted from some nodes due to timeout or some reason, you may see the message like "Transport endpoint is not connected" when accessing the endpoint directory on the node. Just manually unmount it again should resolve the issue.
Note
When using "rclone config" command to configure your cloud storage on HPCC, the command will guide you through an interactive setup process. At the step of auto config, after you chose "y", it will start authentication. You will see something like:
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
At this time, a Firefox browser should be opened. If you did not get the browser window, check if you used -X option to allow X11 forwarding when you run ssh. You may follow the instructions at Connect to HPCC System to get the display right.
It will take a few minutes to get the browser open and connected. Please be patient. If the browser window is open but does not open the authentication page, you could manually input the link provided by the "rclone config" command to the firefox browser's url address box to connect to the site. DO NOT use the link on your personal computer's browser. The authentication have to use the browser on HPCC development node.