htsjdk库VariantContext类介绍

avatar
作者
筋斗云
阅读量:0

HTSJDK是一个用于处理高通量测序数据的 Java 库,其中VariantContext是这个库中的一个重要类,用于表示变异数据。VariantContext类是htsjdk.variant包的一部分,主要用于处理和描述变异(如 SNPs、插入和删除等)。

VariantContext类介绍

功能

VariantContext类用于存储和操作与变异相关的信息,包括:

  • 变异的位置(位置坐标)
  • 变异的类型(SNP、插入、删除等)
  • 变异的等位基因
  • 变异的质量分数
  • 变异的过滤信息
  • 样本级别的信息(如基因型)
主要属性和方法
  1. 位置

    • getContig(): 返回变异所在的染色体或 contig 名称。
    • getStart(): 返回变异的起始位置(1-based)。
    • getEnd(): 返回变异的结束位置(1-based)。
  2. 变异类型

    • isSNP(): 检查变异是否是一个单核苷酸多态性(SNP)。
    • isIndel(): 检查变异是否是插入或删除(Indel)。
  3. 等位基因

    • getAlleles(): 返回变异的等位基因列表。
    • getReference(): 返回参考等位基因。
    • getAlternateAlleles(): 返回变异等位基因列表。
  4. 基因型

    • getGenotypes(): 返回所有样本的基因型信息。
    • getGenotype(String sampleName): 获取特定样本的基因型。
  5. 过滤和质量

    • isFiltered(): 检查变异是否被过滤掉。
    • getPhredScaledQual(): 返回变异的质量分数。
  6. 附加信息

    • getAttributes(): 获取变异的附加信息(例如注释、额外的标记等)。

源代码:

/* * Copyright (c) 2012 The Broad Institute *  * Permission is hereby granted, free of charge, to any person * obtaining a copy of this software and associated documentation * files (the "Software"), to deal in the Software without * restriction, including without limitation the rights to use, * copy, modify, merge, publish, distribute, sublicense, and/or sell * copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following * conditions: *  * The above copyright notice and this permission notice shall be * included in all copies or substantial portions of the Software. *  * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR * THE USE OR OTHER DEALINGS IN THE SOFTWARE. */  package htsjdk.variant.variantcontext;  import htsjdk.beta.plugin.HtsRecord; import htsjdk.tribble.Feature; import htsjdk.tribble.TribbleException; import htsjdk.tribble.util.ParsingUtils; import htsjdk.variant.utils.GeneralUtils; import htsjdk.variant.vcf.*;  import java.io.Serializable; import java.util.ArrayList; import java.util.Collection; import java.util.Collections; import java.util.Comparator; import java.util.EnumSet; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; import java.util.regex.Pattern; import java.util.stream.Collectors;  /**  *   * <h3> High-level overview </h3>  *  * The VariantContext object is a single general class system for representing genetic variation data composed of:  * <ul>  * <li>Allele: representing single genetic haplotypes (A, T, ATC, -) (note that null alleles are used here for illustration; see the Allele class for how to represent indels)</li>  * <li>Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus</li>  * <li>VariantContext: an abstract class holding all segregating alleles at a locus as well as genotypes  *    for multiple individuals containing alleles at that locus</li>  * </ul>  * <p>  * The class system works by defining segregating alleles, creating a variant context representing the segregating  * information at a locus, and potentially creating and associating genotypes with individuals in the context.  *</p>  *<p>  * All of the classes are highly validating -- call <code>validate()</code> if you modify them -- so you can rely on the  * self-consistency of the data once you have a <code>VariantContext</code> in hand.  The system has a rich set of assessor  * and manipulator routines, as well as more complex static support routines in <code>VariantContextUtils</code>.  *</p>  *<p>  * The <code>VariantContext</code> (and <code>Genotype</code>) objects are attributed (supporting addition of arbitrary key/value pairs) and  * filtered (can represent a variation that is viewed as suspect).  *</p>  *<p>  *<code>VariantContext</code>s are dynamically typed, so whether a <code>VariantContext</code> is a SNP, Indel, or NoVariant depends  * on the properties of the alleles in the context.  See the detailed documentation on the <code>Type</code> parameter below.  *</p>  *<p>  * It's also easy to create subcontexts based on selected genotypes.  *</p>  * <h3>Working with Variant Contexts</h3>  * By default, VariantContexts are immutable.  In order to access (in the rare circumstances where you need them)  * setter routines, you need to create <code>MutableVariantContext</code>s and <code>MutableGenotype</code>s.  *  * <h3>Some example data </h3>  *<pre>  * Allele A, Aref, T, Tref;  * Allele del, delRef, ATC, ATCref;  *</pre>  *<p>  * A [ref] / T at 10  *</p>  *<pre>   * GenomeLoc snpLoc = GenomeLocParser.createGenomeLoc("chr1", 10, 10);  *</pre>  *<p>  * A / ATC [ref] from 20-23  *</p>  *<pre>  * GenomeLoc delLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 22);  *</pre>  *<p>  *  // A [ref] / ATC immediately after 20  *  </p>  *  <pre>  * GenomeLoc insLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 20);  *</pre>  * <h3> Alleles </h3>  *  * See the documentation in the <code>Allele</code> class itself  *  * <h4>What are they?</h4>  *  * <p>Alleles can be either reference or non-reference</p>  *  * <p>Examples of alleles used here:</p>  *<pre>  *   A = new Allele("A");  *   Aref = new Allele("A", true);  *   T = new Allele("T");  *   ATC = new Allele("ATC");  *</pre>  * <h3> Creating variant contexts </h3>  *  * <h4> By hand </h4>  *  * Here's an example of a A/T polymorphism with the A being reference:  *  * <pre>  * VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref, T));  * </pre>  *  * If you want to create a non-variant site, just put in a single reference allele  *  * <pre>  * VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref));  * </pre>  *  * A deletion is just as easy:  *  * <pre>  * VariantContext vc = new VariantContext(name, delLoc, Arrays.asList(ATCref, del));  * </pre>  *  * The only thing that distinguishes between an insertion and deletion is which is the reference allele.  * An insertion has a reference allele that is smaller than the non-reference allele, and vice versa for deletions.  *  * <pre>  * VariantContext vc = new VariantContext("name", insLoc, Arrays.asList(delRef, ATC));  * </pre>  *  * <h4> Converting rods and other data structures to <code>VariantContext</code>s </h4>  *  * You can convert many common types into VariantContexts using the general function:  *  * <pre>  * VariantContextAdaptors.convertToVariantContext(name, myObject)  * </pre>  *  * dbSNP and VCFs, for example, can be passed in as <code>myObject</code> and a <code>VariantContext</code> corresponding to that  * object will be returned.  A <code>null</code> return value indicates that the type isn't yet supported.  This is the best  * and easiest way to create contexts using RODs.  *  *  * <h3> Working with genotypes </h3>  *  * <pre>  * List&lt;Allele&gt; alleles = Arrays.asList(Aref, T);  * Genotype g1 = new Genotype(Arrays.asList(Aref, Aref), "g1", 10);  * Genotype g2 = new Genotype(Arrays.asList(Aref, T), "g2", 10);  * Genotype g3 = new Genotype(Arrays.asList(T, T), "g3", 10);  * VariantContext vc = new VariantContext(snpLoc, alleles, Arrays.asList(g1, g2, g3));  * </pre>  *  * At this point we have 3 genotypes in our context, g1-g3.  *  * You can assess a good deal of information about the genotypes through the <code>VariantContext</code>:  *  * <pre>  * vc.hasGenotypes()  * vc.isMonomorphicInSamples()  * vc.isPolymorphicInSamples()  * vc.getSamples().size()  *  * vc.getGenotypes()  * vc.getGenotypes().get("g1")  * vc.hasGenotype("g1")  *  * vc.getCalledChrCount()  * vc.getCalledChrCount(Aref)  * vc.getCalledChrCount(T)  * </pre>  *  * <h3> NO_CALL alleles </h3>  *  * The system allows one to create <code>Genotype</code>s carrying special NO_CALL alleles that aren't present in the  * set of context alleles and that represent undetermined alleles in a genotype:  *<pre>  * Genotype g4 = new Genotype(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL), "NO_DATA_FOR_SAMPLE", 10);  *</pre>  *  * <h3> subcontexts </h3>  * It's also very easy get subcontext based only the data in a subset of the genotypes:  *  * <pre>  * VariantContext vc12 = vc.subContextFromGenotypes(Arrays.asList(g1,g2));  * VariantContext vc1 = vc.subContextFromGenotypes(Arrays.asList(g1));  * </pre>  *  * <!-- comment by jdenvir: not sure what this tag is supposed to do:-->  * <!-- <s3> -->  *     <h3>Fully decoding.</h3>    *     Currently <code>VariantContext</code>s support some fields, particularly those  *     stored as generic attributes, to be of any type.  For example, a field AB might  *     be naturally a floating point number, 0.51, but when it's read into a VC its  *     not decoded into the Java presentation but left as a string "0.51".  A fully  *     decoded <code>VariantContext</code> is one where all values have been converted to their  *     corresponding Java object types, based on the types declared in a <code>VCFHeader</code>.  *  *     The <code>fullyDecode(...)</code> method takes a header object and creates a new fully decoded <code>VariantContext</code>  *     where all fields are converted to their true java representation.  The <code>VCBuilder</code>  *     can be told that all fields are fully decoded, in which case no work is done when  *     asking for a fully decoded version of the VC.  * <!-- </s3> -->  *  */ public class VariantContext implements HtsRecord, Feature, Serializable {     public static final long serialVersionUID = 1L;      private static final boolean WARN_ABOUT_BAD_END = true;     private static final int MAX_ALLELE_SIZE_FOR_NON_SV = 150;     private boolean fullyDecoded = false;     protected CommonInfo commonInfo = null;     public static final double NO_LOG10_PERROR = CommonInfo.NO_LOG10_PERROR;      public static final Set<String> PASSES_FILTERS = Collections.emptySet();      /** The location of this VariantContext */     protected final String contig;     protected final long start;     protected final long stop;     private final String ID;      /** The type (cached for performance reasons) of this context */     protected Type type = null;      /** The type of this context, cached separately if ignoreNonRef is true */     protected Type typeIgnoringNonRef = null;      /** A set of the alleles segregating in this context */     protected final List<Allele> alleles;      /** A mapping from sampleName -&gt; genotype objects for all genotypes associated with this context */     protected GenotypesContext genotypes = null;      /** Counts for each of the possible Genotype types in this context */     protected int[] genotypeCounts = null;      public static final GenotypesContext NO_GENOTYPES = GenotypesContext.NO_GENOTYPES;      // a fast cached access point to the ref / alt alleles for biallelic case     private Allele REF = null;      // set to the alt allele when biallelic, otherwise == null     private Allele ALT = null;      /* cached monomorphic value: null -> not yet computed, False, True */     private Boolean monomorphic = null;      /*      * Determine which genotype fields are in use in the genotypes in VC      * @return an ordered list of genotype fields in use in VC.  If vc has genotypes this will always include GT first      */      public List<String> calcVCFGenotypeKeys(final VCFHeader header) {         final Set<String> keys = new HashSet<>();          boolean sawGoodGT = false;         boolean sawGoodQual = false;         boolean sawGenotypeFilter = false;         boolean sawDP = false;         boolean sawAD = false;         boolean sawPL = false;         for (final Genotype g : this.getGenotypes()) {             keys.addAll(g.getExtendedAttributes().keySet());             if ( g.isAvailable() ) sawGoodGT = true;             if ( g.hasGQ() ) sawGoodQual = true;             if ( g.hasDP() ) sawDP = true;             if ( g.hasAD() ) sawAD = true;             if ( g.hasPL() ) sawPL = true;             if (g.isFiltered()) sawGenotypeFilter = true;         }          if ( sawGoodQual ) keys.add(VCFConstants.GENOTYPE_QUALITY_KEY);         if ( sawDP ) keys.add(VCFConstants.DEPTH_KEY);         if ( sawAD ) keys.add(VCFConstants.GENOTYPE_ALLELE_DEPTHS);         if ( sawPL ) keys.add(VCFConstants.GENOTYPE_PL_KEY);         if ( sawGenotypeFilter ) keys.add(VCFConstants.GENOTYPE_FILTER_KEY);          List<String> sortedList = ParsingUtils.sortList(new ArrayList<>(keys));          // make sure the GT is first         if (sawGoodGT) {             final List<String> newList = new ArrayList<>(sortedList.size() + 1);             newList.add(VCFConstants.GENOTYPE_KEY);             newList.addAll(sortedList);             sortedList = newList;         }          if (sortedList.isEmpty() && header.hasGenotypingData()) {             // this needs to be done in case all samples are no-calls             return Collections.singletonList(VCFConstants.GENOTYPE_KEY);         } else {             return sortedList;         }     }       // ---------------------------------------------------------------------------------------------------------     //     // validation mode     //     // ---------------------------------------------------------------------------------------------------------      //no controls and white-spaces characters, no semicolon.     public static final Pattern VALID_FILTER = Pattern.compile("^[!-:<-~]+$");      public enum Validation {         ALLELES() {             void validate(VariantContext variantContext) {                 validateAlleles(variantContext);             }         },         GENOTYPES() {             void validate(VariantContext variantContext) {                 validateGenotypes(variantContext);             }         },         FILTERS {             void validate(VariantContext variantContext) {                 validateFilters(variantContext);             }         };          abstract void validate(VariantContext variantContext);           private static void validateAlleles(fina

广告一刻

为您即时展示最新活动产品广告消息,让您随时掌握产品活动新动态!