2012-04-03

Missing feature locations in GenBank "with parts"

My last blog post was on a problem with missing NCBI GenBank feature location information for trans-spliced genes when using GenBank 'with parts'. That was fixed, but while using EFetch a colleague just stumbled over something possibly related where the location is given as just a question mark. This took a little digging to understand...

Unfortunately this example is quite large file, rice chromosome one - to be precise 43342410 bp Oryza sativa Japonica Group chr1 (accession BA000010.8). If you look at it in the default GenBank representation, the file is very small because it contains no gene features and no sequence:

http://www.ncbi.nlm.nih.gov/nuccore/BA000010.8?report=genbank&log$=seqview

Essentially all you get is a long CONTIG entry telling you how to construct the complete record from the parts - here's an except with the triple dots indicating omitted text:
...
CONTIG      join(AP003727.3:1..173729,AP002818.2:17598..144644,
            ...
            AP003073.2:4706..172541,AP003260.5:19200..169071,
            AP003247.4:62447..155432,AP003570.3:77008..131918,
            AP004331.2:2247..88088,AP003411.3:35707..132059,
            ...
            AP003228.3:32902..149195)
//


The reason I've highlighted part of that in yellow will be explained below. Interpreting that CONTIG information is pretty fiddly, so the NCBI kindly do this for your if you switch the display to "GenBank (full)" aka "GenBank with parts" (from the EFetch terminology gbwithparts). This gives an 11MB file which does include the genes and sequence. Here's the problem section:
     gene            complement(<34564734..>34566625)
                     /gene="P0468B07.50"
     mRNA            complement(join(<34564734..34565162,34565386..34565711,
                     34566175..>34566625))
                     /gene="P0468B07.50"
                     /note="start and end point are not identified"
     CDS             complement(join(34564734..34565162,34565386..34565711,
                     34566175..34566625))
                     /gene="P0468B07.50"
                     /note="predicted by FGENESH etc."
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="BAB89653.1"
                     /db_xref="GI:20160710"
     gene            ?
                     /gene="P0425G02.14"
     misc_feature    ?
                     /gene="P0425G02.14"
                     /note="hypothetical ORF;
                     predicted by GENSCAN;
                     this category is not included in IRGSP standard"
     gene            join(34567919..34568144,34569041..34569118,
                     34572036..34572162,34573059..34573271,34573332..34573422,
                     34574365..34574565)
                     /gene="P0425G02.15"
     misc_feature    join(34567919..34568144,34569041..34569118,
                     34572036..34572162,34573059..34573271,34573332..34573422,
                     34574365..34574565)
                     /gene="P0425G02.15"
                     /note="hypothetical ORF;
                     predicted by GENSCAN;
                     this category is not included in IRGSP standard"

Highlighted in yellow are the gene and misc_feature entries for a hypothetical gene,  P0425G02.14, and in orange the invalid location string of a question mark. This means a strict GenBank parser will reject this file.

Hypothetical gene P0425G02.14 is a 177bp fragment from within the 155432 bp sequence AP003247.4. Here's the relevant chunk of annotation, again in GenBank format:

     gene            complement(<59783..>61674)
                     /gene="P0425G02.13"
     mRNA            complement(join(<59783..60211,60435..60760,61224..>61674))
                     /gene="P0425G02.13"
                     /note="start and end point are not identified"
     CDS             complement(join(59783..60211,60435..60760,61224..61674))
                     /gene="P0425G02.13"
                     /note="predicted by FGENESH etc."
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="BAB85237.1"
                     /db_xref="GI:18844766"
                     /translation="MAESSESAWPQQSQQLQISSTMPAGSAWPEEENLENLEQPLPLL
                     MPSSEDHREQQLVPVPWLQQDQDQEWHEQEQFLPLKNQNQEQLQDQQPLQDQEETRRY
                     LGVPGIRFVPSDIELILDFLRPKLRGEQLPSYSYMHVCDVYSDHPKELTSKLGPSREG
                     NWYMFSPRNRKYNKGKRPSRSTGQLGFWKSTTKNEAVLDALSDNMLIGYKACLTYHEY
                     DESMPTPKLKKENAIKTPWKMWEFVCSNSNRPFDAEEEPMRLNDWVLCKVTNKDNKVT
                     TKKFKPQRSKKPKKPKKLQQEEQPQNQGIVIRQPSESGSASSSHQEIPGSSLPGAGGD
                     AAAAAATAAAVVDPMPLHMIPPSSWNYFSTGVTADGIVMDDSTGVDSYGCVDGAGALN
                     FQRNIFYHR"
     gene            join(62210..62212,62274..62447)
                     /gene="P0425G02.14"
     misc_feature    join(62210..62212,62274..62447)
                     /gene="P0425G02.14"
                     /note="hypothetical ORF;
                     predicted by GENSCAN;
                     this category is not included in IRGSP standard"
     gene            join(62968..63193,64090..64167,67085..67211,68108..68320,
                     68381..68471,69414..69614)
                     /gene="P0425G02.15"
     misc_feature    join(62968..63193,64090..64167,67085..67211,68108..68320,
                     68381..68471,69414..69614)
                     /gene="P0425G02.15"
                     /note="hypothetical ORF;
                     predicted by GENSCAN;
                     this category is not included in IRGSP standard"

Referring to the CONTIG line earlier from the original GenBank file, we see the assembled chromosome includes only bases 62447 to 155432 from AP003247.4, which seems to explain the problem. This only covers part of the P0425G02.14 gene (but all of the subsequent gene P0425G02.15 is taken, which transferred fine).

In fact, the only bit of the hypothetical gene P0425G02.14 included is the very last base! What does this mean? I don't know how the NCBI normally handle this, but I suspect that either the feature should be absent completely, or it should have a location <3450495..3450495 indicating it starts before base 3450495 and ends at base 3450495 making it at least one bp in length. Ugly I agree.

[That is based on the offset of 34504951 being applied to P0425G02.14, where the start AP003247.4:62968 maps to BA000010.8:34567919 - we don't have to worry about flipping the strand here. Thus the final base of the feature, AP003247.462447, maps to BA000010.8:34504951 in the assembled chromosome.]

I presume that normal best practice when splice together the fragments into a scaffolded chromosome it to avoid breaking any annotated features like this. Moving the cut one base forward (right on the boundary between P0425G02.14 and P0425G02.15 which abut) would solve this case. Of course in general, this isn't going to be possible - so the NCBI code for gbwithparts needs to cover transferring partial features.

I'm about to email the NCBI a bug report pointing at this blog post for the details.

Update 24 May 2012

I've not heard back from the NCBI since an email on 10 April 2012 saying the issue was being passed to the RefSeq team. Since the problem is still there, I've sent a follow up email.

Update 24 May 2012

In reply to my email, "The developers have identified the problem and are working on a fix."

No comments:

Post a Comment