Title: | An R Implementation of the 'Align-GVGD' Method |
---|---|
Description: | 'Align-GVGD' ('A-GVGD') is a method to predict the impact of 'missense' substitutions based on the properties of amino acid side chains and protein multiple sequence alignments <doi:10.1136/jmg.2005.033878>. 'A-GVGD' is an extension of the original 'Grantham' distance to multiple sequence alignments. This package provides an alternative R implementation to the web version found on <http://agvgd.hci.utah.edu/>. |
Authors: | Ramiro Magno [aut, cre] , Isabel Duarte [aut] , Ana-Teresa Maia [aut] , CINTESIS [cph, fnd] |
Maintainer: | Ramiro Magno <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2024-11-12 04:41:29 UTC |
Source: | https://github.com/maialab/agvgd |
This function implements the Align-GVGD (A-GVGD) method described in Tavtigian et al. (2006).
A-GVGD combines multiple sequence alignment of orthologous sequences with the Grantham distance to classify missense variants, i.e. to distinguish human disease susceptibility missense changes from changes of little clinical significance.
The biochemical variation at each alignment position is converted to a Grantham Variation score (GV) and the difference between these properties and those of the variant amino acid being assessed are calculated and a Grantham Difference score generated (GD). The predicted effect is classed as C0, C15, C25, C35, C45, C55, or C65, with C65 most likely to interfere with function and C0 least likely.
agvgd( alignment, poi, sub, mode = c("recycle", "expand_grid"), sort = FALSE, keep_self = TRUE, digits = 2L )
agvgd( alignment, poi, sub, mode = c("recycle", "expand_grid"), sort = FALSE, keep_self = TRUE, digits = 2L )
alignment |
A character matrix or an alignment object obtained with
|
poi |
A whole number indicating the position of interest (POI). |
sub |
A character vector of protein residue substitutions to be classified. The amino acids must be provided as one-letter symbols. |
mode |
If both |
sort |
Whether to sort the output by |
keep_self |
Whether to keep those results in the output that correspond
to residues being the same in |
digits |
Integer indicating the number of decimal places to be used in
rounding |
A tibble whose observations refer to the combination alignment position and amino acid substitution; consists of seven variables:
Position of the amino acid residue in the reference protein
(first sequence in the alignment). This position corresponds to poi
minus
the gaps in the alignment.
Position of interest, i.e. the alignment position at which the amino acid substitution is being assessed.
Reference amino acid, i.e. the amino acid in the first sequence of the alignment, at the position of interest.
Amino acid substitution being assessed.
Grantham variation score.
Grantham difference score.
Predicted effect of the amino acid substitution. This is classed as C0, C15, C25, C35, C45, C55, or C65, with C65 most likely to interfere with function and C0 least likely.
Tavtigian, S.V., Deffenbaugh, A. M., Yin, L., Judkins, T., Scholl, T., Samollow, P.B., de Silva, D., Zharkikh, A., Thomas, A. Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. Journal of Medical Genetics 43, 295–305 (2006). doi:10.1136/jmg.2005.033878.
Mathe, E., Olivier, M., Kato, S., Ishioka, C., Hainaut, P., Tavtigian, S.V. Computational approaches for predicting the biological effect of p53 missense mutations: a comparison of three sequence analysis based methods. Nucleic Acids Research 34, 1317–1325 (2006). doi:10.1093/nar/gkj518.
# Read an alignment into R, e.g. the alignment for gene ATM. alignment_ATM <- read_alignment(gene = 'ATM') # Predict the impact of changing the first residue (Met) to a Serine (S). agvgd(alignment = alignment_ATM, poi = 1, sub = 'S') # `poi` can be a vector of positions, e.g., 3 thru 10, allow for prediction # of multiple positions at once. agvgd(alignment = alignment_ATM, poi = 3:10, sub = 'S') # `poi` expects a position in the frame of reference of the alignment, i.e. # an alignment position (a column index). However, if you know instead # the residue position in the reference sequence (first sequence in the # alignment), then you may use the function `res_to_poi()` # to convert from residue position to alignment position. # # Example: The second residue in the reference sequence of the ATM alignment # is a Serine, after a Methionine. In the alignment, there is a gap between # the two residues, so the alignment is 3 but the residue position on the # protein is 2. (poi2 <- res_to_poi(alignment_ATM, 2)) agvgd(alignment = alignment_ATM, poi = poi2, sub = 'A') # Because changes are context-dependent, i.e. they depend on the residue # variation observed at a given alignment position, the same reference # residue when replaced with the same substitution will in general have # a different predicted impact. agvgd(alignment = alignment_ATM, poi = 9:10, sub = 'S') # Use the ancillary function `amino_acids()` to get a vector of one-letter # residue substitutions if you want to quickly assess the impact of all # possible substitutions. agvgd(alignment = alignment_ATM, poi = 1, sub = amino_acids()) # Parameter `mode` gives you flexibility on how to combine `poi` and `sub`. agvgd(alignment = alignment_ATM, poi = 3:4, sub = c('A', 'V')) # Use 'expand_grid' for all combinations. agvgd(alignment = alignment_ATM, poi = 3:4, sub = c('A', 'V'), mode = 'expand_grid')
# Read an alignment into R, e.g. the alignment for gene ATM. alignment_ATM <- read_alignment(gene = 'ATM') # Predict the impact of changing the first residue (Met) to a Serine (S). agvgd(alignment = alignment_ATM, poi = 1, sub = 'S') # `poi` can be a vector of positions, e.g., 3 thru 10, allow for prediction # of multiple positions at once. agvgd(alignment = alignment_ATM, poi = 3:10, sub = 'S') # `poi` expects a position in the frame of reference of the alignment, i.e. # an alignment position (a column index). However, if you know instead # the residue position in the reference sequence (first sequence in the # alignment), then you may use the function `res_to_poi()` # to convert from residue position to alignment position. # # Example: The second residue in the reference sequence of the ATM alignment # is a Serine, after a Methionine. In the alignment, there is a gap between # the two residues, so the alignment is 3 but the residue position on the # protein is 2. (poi2 <- res_to_poi(alignment_ATM, 2)) agvgd(alignment = alignment_ATM, poi = poi2, sub = 'A') # Because changes are context-dependent, i.e. they depend on the residue # variation observed at a given alignment position, the same reference # residue when replaced with the same substitution will in general have # a different predicted impact. agvgd(alignment = alignment_ATM, poi = 9:10, sub = 'S') # Use the ancillary function `amino_acids()` to get a vector of one-letter # residue substitutions if you want to quickly assess the impact of all # possible substitutions. agvgd(alignment = alignment_ATM, poi = 1, sub = amino_acids()) # Parameter `mode` gives you flexibility on how to combine `poi` and `sub`. agvgd(alignment = alignment_ATM, poi = 3:4, sub = c('A', 'V')) # Use 'expand_grid' for all combinations. agvgd(alignment = alignment_ATM, poi = 3:4, sub = c('A', 'V'), mode = 'expand_grid')
This function returns either a data frame of the pre-bundled alignments if
parameter gene
is missing (default behaviour), or the file name of the
alignment of a supplied gene name.
alignment_file(gene)
alignment_file(gene)
gene |
The gene name of one of the pre-bundled alignments. Run
|
Either a data frame of the pre-bundled alignments if parameter gene
is missing (default behaviour), or the file name of the alignment of a
supplied gene name.
# List pre-bundled alignment file names and associated genes alignment_file() # Retrieve the file name of an alignment alignment_file("BRCA1") # You may get the full path to an alignment file with `system.file()` system.file("extdata", alignment_file("BRCA1"), package = "agvgd")
# List pre-bundled alignment file names and associated genes alignment_file() # Retrieve the file name of an alignment alignment_file("BRCA1") # You may get the full path to an alignment file with `system.file()` system.file("extdata", alignment_file("BRCA1"), package = "agvgd")
The 20 amino acids that are encoded directly by the codons of the universal genetic code.
amino_acids(code = c("one_letter", "three_letter"))
amino_acids(code = c("one_letter", "three_letter"))
code |
The type of amino acid symbol to be returned, one-letter ('one_letter') or three-letter ('three_letter') codes. |
A character vector of the 20 standard amino acids.
# By default `amino_acids` returns one-letter symbols amino_acids() # Use code = 'three_letter' instead for three-letter symbols amino_acids(code = 'three_letter')
# By default `amino_acids` returns one-letter symbols amino_acids() # Use code = 'three_letter' instead for three-letter symbols amino_acids(code = 'three_letter')
This function determines the range (minimum and maximum) values for the three amino acid side chain property values — composition, polarity and molecular volume — from the amino acids at the alignment position of interest.
The alignment passed in alignment
must be an already focused alignment of
three columns whose second column is the position of interest.
cpv_ranges(alignment, exclude = c("-", "X", NA_character_))
cpv_ranges(alignment, exclude = c("-", "X", NA_character_))
alignment |
A character matrix or an alignment object obtained with
|
exclude |
A vector of character values to be ignored when collecting the amino acids at the position of interest. |
A tibble with one single row, of six
variables, i.e., the minimum and maximum values for composition (c_min
and c_max
), polarity (p_min
and p_max
) and molecular volume (v_min
and v_max
).
# You need to first focus the alignment around the position of interest. The # position of interest is position 4 in the example below. After subsetting # the alignment, it becomes position 2. alignment <- read_alignment('ATM') alignment[, 3:5] cpv_ranges(alignment[, 3:5]) # If at the position of interest there are symbols other than amino acid # symbols, e.g. gaps ("-"), then these are ignored and the calculated ranges # are based only on the observed amino acids. alignment[, 270:272] cpv_ranges(alignment[, 270:272])
# You need to first focus the alignment around the position of interest. The # position of interest is position 4 in the example below. After subsetting # the alignment, it becomes position 2. alignment <- read_alignment('ATM') alignment[, 3:5] cpv_ranges(alignment[, 3:5]) # If at the position of interest there are symbols other than amino acid # symbols, e.g. gaps ("-"), then these are ignored and the calculated ranges # are based only on the observed amino acids. alignment[, 270:272] cpv_ranges(alignment[, 270:272])
This function calculates the deviation in the sense of the Grantham deviation
as introduced by Tavtigian et al. (2006). Essentially, if x
lies within the
range [min
, max
], then dev()
returns 0. If x
is either below min
, or
above max
, then dev()
returns the absolute difference between x
and
min
or max
, respectively.
Inputs are recycled in the sense of vctrs::vec_recycle()
.
dev(x, min, max)
dev(x, min, max)
x |
A numeric vector. |
min |
A numeric vector. |
max |
A numeric vector. |
Here's a plot showcasing dev()
with min = -4
and max = 3
:
x <- -10:10; min <- -4; max <- 3 plot(x, y = dev(x, min, max), type = 'l', xlab = 'x', ylab = 'deviation')
A numeric vector of deviations.
# `dev()` returns absolute differences from either min or max (whichever is # closest). dev(10, min = -4, max = 4) dev(-10, min = -4, max = 4) # `x` can be a vector dev(-10:10, min = -4, max = 4) # `min` and `max` can also be vectors, they will be recycled dev(-10:10, min = -4:16, max = 4:24) # If `x` contains `NA` values, then `dev()` will return `NA` for # those cases dev(c(10, NA), min = -4, max = 4) # For each calculation of deviation, only either `min` or `max` is used. If # the unused parameter is `NA` it won't affect the calculation: dev(c(10, 3), min = c(NA, -4), max = 4) dev(c(10, -5), min = -4, max = c(4, NA))
# `dev()` returns absolute differences from either min or max (whichever is # closest). dev(10, min = -4, max = 4) dev(-10, min = -4, max = 4) # `x` can be a vector dev(-10:10, min = -4, max = 4) # `min` and `max` can also be vectors, they will be recycled dev(-10:10, min = -4:16, max = 4:24) # If `x` contains `NA` values, then `dev()` will return `NA` for # those cases dev(c(10, NA), min = -4, max = 4) # For each calculation of deviation, only either `min` or `max` is used. If # the unused parameter is `NA` it won't affect the calculation: dev(c(10, 3), min = c(NA, -4), max = 4) dev(c(10, -5), min = -4, max = c(4, NA))
This function calculates the Grantham deviation ():
where is the value for composition
of amino acid
,
i.e. the atomic weight ratio of hetero (noncarbon) elements in end groups or
rings to carbons in the side chain;
is the value for polarity
of amino acid
; and,
is the value for molecular
volume
of amino acid
.
,
and
are looked up in
grantham::amino_acids_properties based on the amino acid identities passed
in
x
. The function is implemented in
dev()
. Remaining
variables in the equation are arguments to gd()
and hence are explained
below in the Arguments section.
gd( x, c_min, c_max, p_min, p_max, v_min, v_max, alpha = 1.833, beta = 0.1018, gamma = 0.000399, rho = 50.723 )
gd( x, c_min, c_max, p_min, p_max, v_min, v_max, alpha = 1.833, beta = 0.1018, gamma = 0.000399, rho = 50.723 )
x |
A character vector of one-letter amino acid codes, indicating missense substitutions. |
c_min |
Amino acid composition, minimum value. |
c_max |
Amino acid, composition, maximum value. |
p_min |
Amino acid polarity, minimum value. |
p_max |
Amino acid polarity, maximum value. |
v_min |
Amino acid molecular volume, maximum value. |
v_max |
Amino acid molecular volume, maximum value. |
alpha |
The constant |
beta |
The constant |
gamma |
The constant |
rho |
Grantham's distances reported in Table 2, Science (1974).
185(4154): 862–4 by R. Grantham, are scaled by a factor (here named
|
A numeric vector of Grantham deviations. Each deviation corresponds
to one of the amino acids indicated in x
.
gd('S', c_min = 0.39, c_max = 0.74, p_min =4.9, p_max =8.6, v_min = 3, v_max = 32.5)
gd('S', c_min = 0.39, c_max = 0.74, p_min =4.9, p_max =8.6, v_min = 3, v_max = 32.5)
This function calculates the Grantham variation ():
The minimum and maximum values are those observed for a set of amino acid residues at the alignment position of interest.
gv( c_min, c_max, p_min, p_max, v_min, v_max, alpha = 1.833, beta = 0.1018, gamma = 0.000399, rho = 50.723 )
gv( c_min, c_max, p_min, p_max, v_min, v_max, alpha = 1.833, beta = 0.1018, gamma = 0.000399, rho = 50.723 )
c_min |
Amino acid composition, minimum value. |
c_max |
Amino acid, composition, maximum value. |
p_min |
Amino acid polarity, minimum value. |
p_max |
Amino acid polarity, maximum value. |
v_min |
Amino acid molecular volume, maximum value. |
v_max |
Amino acid molecular volume, maximum value. |
alpha |
The constant |
beta |
The constant |
gamma |
The constant |
rho |
Grantham's distances reported in Table 2, Science (1974).
185(4154): 862–4 by R. Grantham, are scaled by a factor (here named
|
A numeric vector of grantham variation values.
# Example based on values from Figure 1C of Tavtigian et al. (2006), # https://doi.org/10.1136/jmg.2005.033878. gv(c_min = 0, c_max = 0, p_min = 5.7, p_max = 4.9, v_min = 132, v_max = 105)
# Example based on values from Figure 1C of Tavtigian et al. (2006), # https://doi.org/10.1136/jmg.2005.033878. gv(c_min = 0, c_max = 0, p_min = 5.7, p_max = 4.9, v_min = 132, v_max = 105)
This function converts an alignment position to a position in the frame of the reference protein sequence, i.e., to the positions of the amino acids in the first sequence of the alignment.
poi_to_res(alignment, poi)
poi_to_res(alignment, poi)
alignment |
An alignment. |
poi |
An alignment position. |
An integer vector of positions of the amino acid residues in the reference sequence.
align_ATM <- read_alignment('ATM') align_ATM[, 1:5] # Convert the positions of the first five alignment positions to residue positions poi_to_res(align_ATM, 1:5)
align_ATM <- read_alignment('ATM') align_ATM[, 1:5] # Convert the positions of the first five alignment positions to residue positions poi_to_res(align_ATM, 1:5)
This function converts a sequence profile as provided in the format of the
package {protean}
to an {agvgd}
alignment — an alignment in this sense
is simply a character matrix whose elements are protein residues in
one-letter notation, rows are sequences and columns correspond to alignment
positions.
profile_to_alignment(profile)
profile_to_alignment(profile)
profile |
A sequence profile object as returned by
|
An alignment object, i.e. a character matrix whose elements are protein residues in one-letter notation. Rows are sequences and columns are alignment positions.
Reads a protein sequence multiple alignment (PSMA) from either a set of pre-bundled alignments, by gene name, or from a Multi-FASTA file.
read_alignment( gene = c("ATM", "BRCA1", "BRCA2", "CHEK2", "MRE11", "MSH6", "NBN", "PALB2", "PMS2", "RAD50", "RAD51", "XRCC2"), file = NULL )
read_alignment( gene = c("ATM", "BRCA1", "BRCA2", "CHEK2", "MRE11", "MSH6", "NBN", "PALB2", "PMS2", "RAD50", "RAD51", "XRCC2"), file = NULL )
gene |
The gene name for which an alignment is provided with this
package. Use the function |
file |
The path to a Multi-FASTA file. If this argument is given, it
takes precedence over the |
An alignment object; essentially, a character matrix, whose elements are protein residues in one-letter notation. Rows are sequences and columns are alignment positions.
# Read in the alignment for the gene XRCC2 read_alignment('XRCC2') # Also read in the alignment for the gene XRCC2, but now by specifying # directly the path to the file. path <- system.file("extdata", alignment_file("XRCC2"), package = "agvgd") read_alignment(file = path)
# Read in the alignment for the gene XRCC2 read_alignment('XRCC2') # Also read in the alignment for the gene XRCC2, but now by specifying # directly the path to the file. path <- system.file("extdata", alignment_file("XRCC2"), package = "agvgd") read_alignment(file = path)
This function reads a file with amino acid substitutions. The format of should be the same one as requested by the web version of Align-GVGD.
read_substitutions( file = stop("`file` must be specified"), amino_acid_code = c("one_letter", "three_letter") )
read_substitutions( file = stop("`file` must be specified"), amino_acid_code = c("one_letter", "three_letter") )
file |
The path to a file with amino acid substitutions. |
amino_acid_code |
The type of symbol used for amino acids in the returned output. |
A tibble listing the amino acids substitutions.
# "sub.txt" is an example file containing missense substitutions formatted # according to the requirements indicated in http://agvgd.hci.utah.edu/help.php. my_file <- system.file("extdata", "sub.txt", package = "agvgd") cat(readLines(my_file), sep = "\n") read_substitutions(file = my_file) # lee2010_sub.txt is a file containing the missense variants studied by # Lee et al. (2010): https://doi.org/10.1158/0008-5472.CAN-09-4563. read_substitutions(file = system.file("extdata", "lee2010_sub.txt", package = "agvgd"))
# "sub.txt" is an example file containing missense substitutions formatted # according to the requirements indicated in http://agvgd.hci.utah.edu/help.php. my_file <- system.file("extdata", "sub.txt", package = "agvgd") cat(readLines(my_file), sep = "\n") read_substitutions(file = my_file) # lee2010_sub.txt is a file containing the missense variants studied by # Lee et al. (2010): https://doi.org/10.1158/0008-5472.CAN-09-4563. read_substitutions(file = system.file("extdata", "lee2010_sub.txt", package = "agvgd"))
This function converts an residue position to a position in the frame of the alignment.
res_to_poi(alignment, res)
res_to_poi(alignment, res)
alignment |
An alignment. |
res |
A residue position. |
An integer vector of alignment positions corresponding to residue position in the reference sequence.
align_ATM <- read_alignment('ATM') align_ATM[, 1:6] # Convert the positions of the first five residues to alignment positions res_to_poi(align_ATM, 1:5)
align_ATM <- read_alignment('ATM') align_ATM[, 1:6] # Convert the positions of the first five residues to alignment positions res_to_poi(align_ATM, 1:5)
This function takes an alignment and exports it to a FASTA file.
write_alignment(alignment, file)
write_alignment(alignment, file)
alignment |
An alignment. It may be a simple matrix or an object
obtained with |
file |
A file path. |
This function is run for its side effect of writing a file. But it
returns the file path passed in file
.
alignment <- matrix( c('P', 'M', 'I', 'P', 'I', 'I', 'P', 'L', 'I'), nrow = 3, byrow = TRUE ) # Export an alignment based on a matrix write_alignment(alignment, "my_alignment.fasta") cat(readLines("my_alignment.fasta"), sep = "\n") # Export one of the bundled alignments write_alignment(read_alignment(gene = 'BRCA1'), "BRCA1.fasta") cat(readLines("BRCA1.fasta")[1:10], sep = "\n")
alignment <- matrix( c('P', 'M', 'I', 'P', 'I', 'I', 'P', 'L', 'I'), nrow = 3, byrow = TRUE ) # Export an alignment based on a matrix write_alignment(alignment, "my_alignment.fasta") cat(readLines("my_alignment.fasta"), sep = "\n") # Export one of the bundled alignments write_alignment(read_alignment(gene = 'BRCA1'), "BRCA1.fasta") cat(readLines("BRCA1.fasta")[1:10], sep = "\n")
This function exports to a file a list of residue substitutions. The format used will be the same one as requested by the web version of Align-GVGD.
write_substitutions( file, alignment, poi, sub, mode = c("recycle", "expand_grid") )
write_substitutions( file, alignment, poi, sub, mode = c("recycle", "expand_grid") )
file |
A file path. |
alignment |
A character matrix or an alignment object obtained with
|
poi |
A whole number indicating the position of interest (POI). |
sub |
A character vector of protein residue substitutions to be classified. The amino acids must be provided as one-letter symbols. |
mode |
If both |
This function is run for its side effect of writing a file. But it
returns the file path passed in file
.
write_substitutions(file = "ex01.csv", alignment = read_alignment("ATM"), poi = 20:25, sub = amino_acids()) cat(readLines("ex01.csv"), sep = "\n") write_substitutions(file = "ex02.csv", alignment = read_alignment("ATM"), poi = 20:21, sub = amino_acids(), mode = 'expand_grid') cat(readLines("ex02.csv"), sep = "\n")
write_substitutions(file = "ex01.csv", alignment = read_alignment("ATM"), poi = 20:25, sub = amino_acids()) cat(readLines("ex01.csv"), sep = "\n") write_substitutions(file = "ex02.csv", alignment = read_alignment("ATM"), poi = 20:21, sub = amino_acids(), mode = 'expand_grid') cat(readLines("ex02.csv"), sep = "\n")