In addition to downloadable files, an API, and the text search, RNAcentral provides a public Postgres database that can be used to query the data using SQL syntax. The database is updated with every RNAcentral release and contains a copy of the data available through the RNAcentral website.
To connect to the database using command line:
Pro tip: if you don't have
psql installed on your machine, consider using Docker to get started with a pre-configured Postgres image:
docker pull postgres
docker run -it postgres psql postgres://reader:NWDMCE5xdipIjRrp@hh-pgsql-public.ebi.ac.uk:5432/pfmegrnargs
If your computer is behind a firewall, please ensure that outgoing TCP/IP connections to the corresponding ports are allowed.
The following diagram was generated based on RNAcentral release 16.
The entire RNAcentral schema contains more than 40 tables, but the following tables are good starting points for exploring the data:
rna- contains RNA sequences and URS identifiers
xref- contains cross-references to Expert Databases
rnc_database- contains a list of Expert Databases
rnc_accessions- contains metadata associated with each cross-reference
rnc_rna_precomputed- contains RNA types and descriptions for all sequences
Although the VEGA database has been archived and its identifiers are no longer searchable using the RNAcentral text search, you can still query RNAcentral using VEGA identifiers:
SELECT upi, -- RNAcentral URS identifier taxid, -- NCBI taxid ac -- external accession FROM xref WHERE ac IN ('OTTHUMT00000106564.1', 'OTTHUMT00000416802.1')
URS00000B15DA 9606 OTTHUMT00000106564.1 URS00000A54A6 9606 OTTHUMT00000416802.1
The RNAcentral text search can only export up to 1,000,000 search results. If you need to export more sequences, you can use the following workflow:
Create a file
SELECT precomputed.id FROM rnc_rna_precomputed precomputed JOIN rnc_taxonomy tax ON tax.id = precomputed.taxid WHERE tax.lineage LIKE 'cellular organisms; Bacteria; %' AND precomputed.is_active = true -- exclude sequences without active cross-references AND rna_type = 'rRNA'
Run the following command to execute the query:
docker run -v `pwd`:/rnacentral -it postgres /bin/sh -c 'cd /rnacentral && psql -t -A -f query.sql postgres://reader:NWDMCE5xdipIjRrp@hh-pgsql-public.ebi.ac.uk:5432/pfmegrnargs > ids.txt'
The command will create a file
ids.txt with a list of RNAcentral identifiers.
Download the following RNAcentral FASTA file:
Extract the sequences using seqkit:
seqkit grep -f ids.txt rnacentral_species_specific_ids.fasta.gz > output.fasta
output.fasta will contain the desired subset of RNAcentral sequences in FASTA format.
Requires psycopg2 to connect to Postgres:
pip install psycopg2
The public Postgres database allows anyone to run the RNAcentral website locally and contribute new code to RNAcentral using the instructions available on GitHub.