Skip to content

Using GLOBUS

We are using the GLOBUS system to curate the data used in the Virtual Ecosystem data science repository.

  • All paths to data files within the repository should be relative paths to a data file location in the data directory.
  • However, to avoid adding large and/or binary data files to the GitHub repository itself, the contents of the data directory are managed using GLOBUS.

At any point, we should be able to re-run analyses by cloning the code from the GitHub repository and then populating the data directory using GLOBUS.

GLOBUS Overview

GLOBUS is a web-based system that provides access to data files.

  • GLOBUS does not store the files itself - it is not cloud storage - but it provides configured connections to data in existing networked storage.

  • GLOBUS also manages access privileges and authentication to connect to data: users register with GLOBUS and can then be granted access to data sets

A single data repository is called a collection. A collection is basically just a configured connection to a particular set of files. Individual users can then be given access to collections. Users can also be made part of a group and that group can be given access to collections.

For the VE Data Science team, we are using GLOBUS to connect to a collection of files hosted on the Imperial College London Research Data Store.

Once you have logged into the GLOBUS web application, you will end up on a page with a set of different tabs on the left hand side.

Access permissions

Globus frequently requests extra authentication steps. This is usually when you are accessing a new part of the GLOBUS functionality. It will typically take you to a page with a prompt like "Session reauthentication required (Globus Transfer)". The page will then also show your login ID email - this is actually a clickable link to start the authentication for the action and should just complete and take you to the page you were trying to access.

The Collections tab

The Collections tab is used to provide an overview of the data collections that you have access to.

  • Start by opening the Collections tab and then clicking the "Shared with you" Button. The URL https://app.globus.org/collections?scope=shared-with-me should take you straight to this page.

  • You should see the "Virtual Ecosystem data science" collection.

  • If you click on that link, you'll see an unfriendly overview page with collection details.
  • You should also see a button marked "Open in File Manager" - click this!

The File Manager

The File Manager tab is used to view the files and folders within a collection and to interact with the data repository. You can access the tab from a particular collection (as above), from the tab button on the left or directly using the URL https://app.globus.org/file-manager

Once you have opened a collection in the pane then you should be able to see the files and folders in the collection and can open folders to explore the data.

Collection paths

When you open the VE Data Science collection, you will see that it shows a path at the top: ve_data_science/data. This is because the collection shares all of the files in our Research Data Store. This includes a clone of the ve_data_science repo but also some other data resources. We are managing access to the files using a GLOBUS group ('VE Data Science team') that only has access to the files under the ve_data_science/data path, so you can't see the other data on that drive.

File Manager actions

The bar in the centre of the file manager provides action buttons to work with files and folders.

  • New Folder, Rename and Delete Selected can be used with any selected folder or file in the collection.

You'd need a very good reason to rename or delete files in the collection and this should be done through GitHub and not GLOBUS. If you change the names or files or folders then this will break any existing scripts that use those files, so if you want to change or delete file paths, you should create a PR that explains the rationale for the changes and also updates any affected scripts.

  • Download and Upload can be used with single files and folders: these allow you to drop a single file or folder from any location into a folder in the collection or download a file or folder

These tools may be all you need for day to day work - if you have a few files to upload this may well be what you want to do. However, if you want to upload a more complex set of files or download a large number of files, this is going to be a problem.

This is where the Transfer or Sync to... option comes in - it allows files and folders to be copied between two collections. To do so, you need to configure your own computer as a collection.

Globus Connect Personal

The Globus Connect Personal application https://www.globus.org/globus-connect-personal is a local application that you install to your computer that sets up a GLOBUS collection on your computer.

  • Download and install the program.
  • When you open it for the first time, it will ask you to log in with your GLOBUS credentials:
  • This will first take you to the GLOBUS website to authorise your GLOBUS account to create and manage a collection.
  • It will then ask for the collection details to create on your computer.
  • It will then start the Globus Connect Personal application.

If you now go to the web application and look at the collections administered by you, you should see the a new Private Mapped Connection:

https://app.globus.org/collections?scope=administered-by-me

In the File Manager tab of the web application, you can now select your personal collection and use the File Manager action buttons to manage your files and transfer folders between the two collections.

Local file access permissions

By default, Globus Connect Personal (GCP) has access to your home directory. Only you have access to the collection, but you can also configure GCP to only be able to access a subset of files. Under the GCP > Preferences settings, you can select the Access tab and specify which files GCP can access and whether GCP is allowed to write to those folders.

Within the GLOBUS web application, you can also check the visibility of your local collection through the Collections tab: click your local collection and then explore the visibility options to check if other users can see the existence of your collection.

The GLOBUS Transfer system

Transfer is used to copy files from a source collection to a destination collection. Here, you could be uploading a folder from your personal collection (source) to the RDS repo (destination) or downloading data from the RDS (source) to your local collection (destination) for analysis. Or possibly doing both to synchronise the two folders!

To transfer files or folders between collections:

  • Select the files or folders on the source collection
  • In the destination folder, open the location where the selected data will be transferred. Do not_ select the folders or files on the destination but instead make sure you have the location that you want to copy to open.

For example, if you are synchronising the data/derived data folder, you will need to select the derived folder in the source collection, but just have the data folder open in the destination. If you do not do this then the data will be transfered into the selected folder.

  • Press the "Start button" above the source collection. GLOBUS will schedule and run the transfer in the background: you can open the activity monitor link to see the progress of the transfer.