阅读量:0
HTSJDK
是一个用于处理高通量测序数据的 Java 库,其中VariantContext
是这个库中的一个重要类,用于表示变异数据。VariantContext
类是htsjdk.variant
包的一部分,主要用于处理和描述变异(如 SNPs、插入和删除等)。
VariantContext
类介绍
功能
VariantContext
类用于存储和操作与变异相关的信息,包括:
- 变异的位置(位置坐标)
- 变异的类型(SNP、插入、删除等)
- 变异的等位基因
- 变异的质量分数
- 变异的过滤信息
- 样本级别的信息(如基因型)
主要属性和方法
位置:
getContig()
: 返回变异所在的染色体或 contig 名称。getStart()
: 返回变异的起始位置(1-based)。getEnd()
: 返回变异的结束位置(1-based)。
变异类型:
isSNP()
: 检查变异是否是一个单核苷酸多态性(SNP)。isIndel()
: 检查变异是否是插入或删除(Indel)。
等位基因:
getAlleles()
: 返回变异的等位基因列表。getReference()
: 返回参考等位基因。getAlternateAlleles()
: 返回变异等位基因列表。
基因型:
getGenotypes()
: 返回所有样本的基因型信息。getGenotype(String sampleName)
: 获取特定样本的基因型。
过滤和质量:
isFiltered()
: 检查变异是否被过滤掉。getPhredScaledQual()
: 返回变异的质量分数。
附加信息:
getAttributes()
: 获取变异的附加信息(例如注释、额外的标记等)。
源代码:
/* * Copyright (c) 2012 The Broad Institute * * Permission is hereby granted, free of charge, to any person * obtaining a copy of this software and associated documentation * files (the "Software"), to deal in the Software without * restriction, including without limitation the rights to use, * copy, modify, merge, publish, distribute, sublicense, and/or sell * copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following * conditions: * * The above copyright notice and this permission notice shall be * included in all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR * THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ package htsjdk.variant.variantcontext; import htsjdk.beta.plugin.HtsRecord; import htsjdk.tribble.Feature; import htsjdk.tribble.TribbleException; import htsjdk.tribble.util.ParsingUtils; import htsjdk.variant.utils.GeneralUtils; import htsjdk.variant.vcf.*; import java.io.Serializable; import java.util.ArrayList; import java.util.Collection; import java.util.Collections; import java.util.Comparator; import java.util.EnumSet; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; import java.util.regex.Pattern; import java.util.stream.Collectors; /** * * <h3> High-level overview </h3> * * The VariantContext object is a single general class system for representing genetic variation data composed of: * <ul> * <li>Allele: representing single genetic haplotypes (A, T, ATC, -) (note that null alleles are used here for illustration; see the Allele class for how to represent indels)</li> * <li>Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus</li> * <li>VariantContext: an abstract class holding all segregating alleles at a locus as well as genotypes * for multiple individuals containing alleles at that locus</li> * </ul> * <p> * The class system works by defining segregating alleles, creating a variant context representing the segregating * information at a locus, and potentially creating and associating genotypes with individuals in the context. *</p> *<p> * All of the classes are highly validating -- call <code>validate()</code> if you modify them -- so you can rely on the * self-consistency of the data once you have a <code>VariantContext</code> in hand. The system has a rich set of assessor * and manipulator routines, as well as more complex static support routines in <code>VariantContextUtils</code>. *</p> *<p> * The <code>VariantContext</code> (and <code>Genotype</code>) objects are attributed (supporting addition of arbitrary key/value pairs) and * filtered (can represent a variation that is viewed as suspect). *</p> *<p> *<code>VariantContext</code>s are dynamically typed, so whether a <code>VariantContext</code> is a SNP, Indel, or NoVariant depends * on the properties of the alleles in the context. See the detailed documentation on the <code>Type</code> parameter below. *</p> *<p> * It's also easy to create subcontexts based on selected genotypes. *</p> * <h3>Working with Variant Contexts</h3> * By default, VariantContexts are immutable. In order to access (in the rare circumstances where you need them) * setter routines, you need to create <code>MutableVariantContext</code>s and <code>MutableGenotype</code>s. * * <h3>Some example data </h3> *<pre> * Allele A, Aref, T, Tref; * Allele del, delRef, ATC, ATCref; *</pre> *<p> * A [ref] / T at 10 *</p> *<pre> * GenomeLoc snpLoc = GenomeLocParser.createGenomeLoc("chr1", 10, 10); *</pre> *<p> * A / ATC [ref] from 20-23 *</p> *<pre> * GenomeLoc delLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 22); *</pre> *<p> * // A [ref] / ATC immediately after 20 * </p> * <pre> * GenomeLoc insLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 20); *</pre> * <h3> Alleles </h3> * * See the documentation in the <code>Allele</code> class itself * * <h4>What are they?</h4> * * <p>Alleles can be either reference or non-reference</p> * * <p>Examples of alleles used here:</p> *<pre> * A = new Allele("A"); * Aref = new Allele("A", true); * T = new Allele("T"); * ATC = new Allele("ATC"); *</pre> * <h3> Creating variant contexts </h3> * * <h4> By hand </h4> * * Here's an example of a A/T polymorphism with the A being reference: * * <pre> * VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref, T)); * </pre> * * If you want to create a non-variant site, just put in a single reference allele * * <pre> * VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref)); * </pre> * * A deletion is just as easy: * * <pre> * VariantContext vc = new VariantContext(name, delLoc, Arrays.asList(ATCref, del)); * </pre> * * The only thing that distinguishes between an insertion and deletion is which is the reference allele. * An insertion has a reference allele that is smaller than the non-reference allele, and vice versa for deletions. * * <pre> * VariantContext vc = new VariantContext("name", insLoc, Arrays.asList(delRef, ATC)); * </pre> * * <h4> Converting rods and other data structures to <code>VariantContext</code>s </h4> * * You can convert many common types into VariantContexts using the general function: * * <pre> * VariantContextAdaptors.convertToVariantContext(name, myObject) * </pre> * * dbSNP and VCFs, for example, can be passed in as <code>myObject</code> and a <code>VariantContext</code> corresponding to that * object will be returned. A <code>null</code> return value indicates that the type isn't yet supported. This is the best * and easiest way to create contexts using RODs. * * * <h3> Working with genotypes </h3> * * <pre> * List<Allele> alleles = Arrays.asList(Aref, T); * Genotype g1 = new Genotype(Arrays.asList(Aref, Aref), "g1", 10); * Genotype g2 = new Genotype(Arrays.asList(Aref, T), "g2", 10); * Genotype g3 = new Genotype(Arrays.asList(T, T), "g3", 10); * VariantContext vc = new VariantContext(snpLoc, alleles, Arrays.asList(g1, g2, g3)); * </pre> * * At this point we have 3 genotypes in our context, g1-g3. * * You can assess a good deal of information about the genotypes through the <code>VariantContext</code>: * * <pre> * vc.hasGenotypes() * vc.isMonomorphicInSamples() * vc.isPolymorphicInSamples() * vc.getSamples().size() * * vc.getGenotypes() * vc.getGenotypes().get("g1") * vc.hasGenotype("g1") * * vc.getCalledChrCount() * vc.getCalledChrCount(Aref) * vc.getCalledChrCount(T) * </pre> * * <h3> NO_CALL alleles </h3> * * The system allows one to create <code>Genotype</code>s carrying special NO_CALL alleles that aren't present in the * set of context alleles and that represent undetermined alleles in a genotype: *<pre> * Genotype g4 = new Genotype(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL), "NO_DATA_FOR_SAMPLE", 10); *</pre> * * <h3> subcontexts </h3> * It's also very easy get subcontext based only the data in a subset of the genotypes: * * <pre> * VariantContext vc12 = vc.subContextFromGenotypes(Arrays.asList(g1,g2)); * VariantContext vc1 = vc.subContextFromGenotypes(Arrays.asList(g1)); * </pre> * * <!-- comment by jdenvir: not sure what this tag is supposed to do:--> * <!-- <s3> --> * <h3>Fully decoding.</h3> * Currently <code>VariantContext</code>s support some fields, particularly those * stored as generic attributes, to be of any type. For example, a field AB might * be naturally a floating point number, 0.51, but when it's read into a VC its * not decoded into the Java presentation but left as a string "0.51". A fully * decoded <code>VariantContext</code> is one where all values have been converted to their * corresponding Java object types, based on the types declared in a <code>VCFHeader</code>. * * The <code>fullyDecode(...)</code> method takes a header object and creates a new fully decoded <code>VariantContext</code> * where all fields are converted to their true java representation. The <code>VCBuilder</code> * can be told that all fields are fully decoded, in which case no work is done when * asking for a fully decoded version of the VC. * <!-- </s3> --> * */ public class VariantContext implements HtsRecord, Feature, Serializable { public static final long serialVersionUID = 1L; private static final boolean WARN_ABOUT_BAD_END = true; private static final int MAX_ALLELE_SIZE_FOR_NON_SV = 150; private boolean fullyDecoded = false; protected CommonInfo commonInfo = null; public static final double NO_LOG10_PERROR = CommonInfo.NO_LOG10_PERROR; public static final Set<String> PASSES_FILTERS = Collections.emptySet(); /** The location of this VariantContext */ protected final String contig; protected final long start; protected final long stop; private final String ID; /** The type (cached for performance reasons) of this context */ protected Type type = null; /** The type of this context, cached separately if ignoreNonRef is true */ protected Type typeIgnoringNonRef = null; /** A set of the alleles segregating in this context */ protected final List<Allele> alleles; /** A mapping from sampleName -> genotype objects for all genotypes associated with this context */ protected GenotypesContext genotypes = null; /** Counts for each of the possible Genotype types in this context */ protected int[] genotypeCounts = null; public static final GenotypesContext NO_GENOTYPES = GenotypesContext.NO_GENOTYPES; // a fast cached access point to the ref / alt alleles for biallelic case private Allele REF = null; // set to the alt allele when biallelic, otherwise == null private Allele ALT = null; /* cached monomorphic value: null -> not yet computed, False, True */ private Boolean monomorphic = null; /* * Determine which genotype fields are in use in the genotypes in VC * @return an ordered list of genotype fields in use in VC. If vc has genotypes this will always include GT first */ public List<String> calcVCFGenotypeKeys(final VCFHeader header) { final Set<String> keys = new HashSet<>(); boolean sawGoodGT = false; boolean sawGoodQual = false; boolean sawGenotypeFilter = false; boolean sawDP = false; boolean sawAD = false; boolean sawPL = false; for (final Genotype g : this.getGenotypes()) { keys.addAll(g.getExtendedAttributes().keySet()); if ( g.isAvailable() ) sawGoodGT = true; if ( g.hasGQ() ) sawGoodQual = true; if ( g.hasDP() ) sawDP = true; if ( g.hasAD() ) sawAD = true; if ( g.hasPL() ) sawPL = true; if (g.isFiltered()) sawGenotypeFilter = true; } if ( sawGoodQual ) keys.add(VCFConstants.GENOTYPE_QUALITY_KEY); if ( sawDP ) keys.add(VCFConstants.DEPTH_KEY); if ( sawAD ) keys.add(VCFConstants.GENOTYPE_ALLELE_DEPTHS); if ( sawPL ) keys.add(VCFConstants.GENOTYPE_PL_KEY); if ( sawGenotypeFilter ) keys.add(VCFConstants.GENOTYPE_FILTER_KEY); List<String> sortedList = ParsingUtils.sortList(new ArrayList<>(keys)); // make sure the GT is first if (sawGoodGT) { final List<String> newList = new ArrayList<>(sortedList.size() + 1); newList.add(VCFConstants.GENOTYPE_KEY); newList.addAll(sortedList); sortedList = newList; } if (sortedList.isEmpty() && header.hasGenotypingData()) { // this needs to be done in case all samples are no-calls return Collections.singletonList(VCFConstants.GENOTYPE_KEY); } else { return sortedList; } } // --------------------------------------------------------------------------------------------------------- // // validation mode // // --------------------------------------------------------------------------------------------------------- //no controls and white-spaces characters, no semicolon. public static final Pattern VALID_FILTER = Pattern.compile("^[!-:<-~]+$"); public enum Validation { ALLELES() { void validate(VariantContext variantContext) { validateAlleles(variantContext); } }, GENOTYPES() { void validate(VariantContext variantContext) { validateGenotypes(variantContext); } }, FILTERS { void validate(VariantContext variantContext) { validateFilters(variantContext); } }; abstract void validate(VariantContext variantContext); private static void validateAlleles(fina