LIDo banner

Apply now

Find out more about the different routes to entry and our eligibility criteria

Aine Fairbrother-Browne: ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

kll
Abstract

We present ensemblQueryR, an R package for querying Ensembl linkage disequilibrium (LD) endpoints. This package is flexible, fast and user-friendly, and optimised for high-throughput querying.

ensemblQueryR uses functions that are intuitive and amenable to custom code integration, familiar R object types as inputs and outputs as well as providing parallelisation functionality. For each Ensembl LD endpoint, ensemblQueryR provides two functions, permitting both single- and multi-query modes of operation. The multi-query functions are optimised for large query sizes and provide optional parallelisation to leverage available computational resources and minimise processing time. We demonstrate improved computational performance of ensemblQueryR over an exisiting tool in terms of random access memory (RAM) usage and speed, delivering a 10-fold speed increase whilst using a third of the RAM. Finally, ensemblQueryR is near-agnostic to operating system and computational architecture through Docker and singularity images, making this tool widely accessible to the scientific community.

Background

Linkage disequilibrium (LD) is the non-random association of alleles arising from different loci [1]. In population genetics, LD is a measure of the frequency with which an allele of one variant is correlated with an allele of a proximal variant within a particular population [2]. There are many applications for LD measures in genomics workflows. For example, in the context of genome-wide association studies (GWAS), which have been used to detect associations between genetic variants and a wide range of human phenotypes, downstream interrogation of local LD structure is required to identify the potential ‘causal’ variant at a nominated locus that exerts an effect on the downstream phenotype. Equally, in expression quantitative trait loci (eQTL) analyses, which aim to uncover associations between genetic variants and the expression of a cis or trans gene (eGene), LD information is required for the identification of the potential causal variant affecting the expression of the eGene. Further downstream, LD information is useful for functional annotations, where genetic variants or regions in LD with a target variant can aid in the identification of biological processes that might be affected by the GWAS- or eQTL-implicated target variant. As such, it is important that the LD information for a range of human populations can be easily queried by researchers in an efficient and accessible way.

Despite the widespread usage of LD measures in genomic research, the majority of tools available at present are web-based. Although these offer user-friendly interfaces and can be useful for one-off or small queries, they do not promote reproducibility and are not suited to workflow-oriented researchers wishing to submit multiple large queries. Programmatic tools offer a solution to these problems; however, very few tools for the retrieval of LD metrics exist.

To our knowledge, only one R package provides a programmatic interface for LD metric retrieval. LDlinkR (version 1.2.3) [3] provides an R-based interface to the web-based tool LDlink [4], permitting retrieval of LD metrics using a range of query types. However, LDlinkR has a number of key limitations with respect to speed and query handling. Firstly, the user is required to obtain an access token by signing up on the NCBI website, which is then supplied as an argument to all LDlinkR functions.

This requirement is in place to limit user queries, meaning that attempts to speed up the tool using parallelisation easily exceed query limits and cause the tool to return timeout errors. This can result in the user’s access token being blocked. Secondly, a number of functions for retrieving LD metrics are configured for singular queries only – such as the LDpair and LDproxy functions – meaning that the user must write custom code to submit more than one query at one time. As such, although LDlinkR is a useful programmatic alternative to the LDlink web tool, it is not suited to fast, high-throughput multi-query retrieval of LD metrics.

Ensembl (RRID:SCR_002344) is another widely used source of LD metrics, offering an application programming interface (API) that supports an array of query configurations [5, 6]. However, some challenges are presented by direct API usage as its usage requires some technical expertise.

Additionally, it is not easily integrable with typical R workflows, precludes the input of standard R objects (such as data frames, lists or vectors), does not output data in an intuitive format and is not easily adaptable to high-throughput workflows. To our knowledge, no R package has been developed to facilitate querying the Ensembl API and, in particular, to retrieve Ensembl LD metrics. In light of this, and to address the limitations of current tools, we present ensemblQueryR. Our R package provides fast, efficient, user-friendly querying of Ensembl LD data, with a focus on intuitive, high-throughput R workflow integration. ensemblQueryR has been made freely available

(DOI: 10.5281/zenodo.7837882) [7, 8]. The package can also be used in Docker (RRID:SCR_016445) [9] or Singularity [10] containers, for which the images can be found on Docker Hub [11] or the Singularity image repository [12].

Read full article here