Skip to content

Databases in common-data

Overview

The HPCC hosts a number of large, widely used genetics databases in the common-data research space (/mnt/research/common-data). These databases are publicly readable on the HPCC, but writing to these folders is limited to ICER staff. While we try to keep these databases up to date as much as possible, if you find something is missing or encounters problems, please open a ticket with us.

NCBI BLAST

NCBI maintains a number of nucleotide and protein sequence databases for use with their BLAST/BLAST+ tools. Single or small sets of sequences can be compared against these databases using the NCBI BLAST webtool. However, for larger, more customizable comparisons, the HPCC maintains a copy of these databases at:

/mnt/research/common-data/Bio/blast_databases/blastdb_current

We also maintain a number of BLAST tools as part of our software module system:

module avail BLAST

For details of the individual databases, please refer to the NCBI documentation.

Alphafold

The protein prediction software AlphaFold requires a set of protein sequence databases to run. Although all versions of the software require similar data, due to small differences in folder structure, we host three different versions of the datbases for AlphaFold 3, 2.3, and older verions of AlphaFold 2.x.

/mnt/research/common-data/alphafold/database_3 # AlphaFold 3
/mnt/research/common-data/alphafold/database_230 # AlphaFold 2.3
/mnt/research/common-data/alphafold/database # AlphaFold 2 Legacy

If you are using one of the AlphaFold modules, the path to the correct database should be set automatically. For more details on running AlphaFold on HPCC, see our docuemntation on AlphaFold 2.3.2 and AlphaFold3 (Coming Soon).

4D Nucleosome

The 4D Nucleosome dataset contains the chromatin contact frequence maps for a large panel of different cell types and tissues from different species. The datasets are overall generated by genome-wide Hi-C experiments, followed by standard batch-effect corrections and normalizations. The chromatin contact frequency maps characterize the information of chromatin interactions, which are useful to analyze 3D genome folding, multi-scale chromatin organizations, gene regulation, epigenomics, evolution and other functional genomics research.

Data on the HPCC covers .hic files from dilution, DNaseHiC, insitu, MicroC and TCC, which are all stored at:

/mnt/research/common-data/4D_Nucleosome/database

To search the database and find additional meta-data on 4D Nucleosome, please refer to the project data portal.