Title: Format of RTPS files (2003-10-10)

1 Terms

[TU (Transcriptional Unit)]
Genomic (discontinous) regions from which one mature mRNA is derived. If multiple transcripts overlap on the genome, these TUs are merged into one.
[TU-based clusters]
Grouped transcripts based on TUs from which they are transcribed.
[RTS (Representative Transcript Set)]
A set of transcripts that are selected from every TU The number of transcripts in RTS = The number of TUs
[RPS (Representative Protein Set)]
A set of proteins that are selected from every TU that has translated sequences
[variant-based clusters]
Grouped transcripts based on splicing patterns
[VTS (Variant-based representative Transcript Set)]
A set of transcripts that are selected from every splicing pattern
[VPS (Variant-based representative Protein Set)]
A set of proteins that are selected from every splicing pattern that has translated sequences
[CTS (Consensus Transcript Set)]
A set of consensus transcripts that are generated from genomic sequences in TU

2 ID format

[TU ID]
integer
[Variant ID]
(TU ID) '.' integer
[RTS ID]
'T' ('A' or 'B') (TU ID)
[RPS ID]
'PA' (TU ID)
[CTS ID]
'CA' (TU ID)
[VTS ID]
'T' ('A' or 'B') (Variant ID)
[VPS ID]
'PA' (Variant ID)
[References to external databases]

(DB name) '|' (ID/Accession number in the DB)

List of DB names

'GB'
DDBJ/EMBL/GenBank DNA sequences
'GP'
GenPept sequences (Translated peptides from DDBJ/EMBL/GenBank sequences)
'REFSEQ'
NCBI RefSeq sequences
'ENSEMBL'
Ensembl predicted transcripts/proteins
'LocusLink'
NCBI LocusLink database
'UniGene'
NCBI UniGene database
'SWISSPROT'
SWISSPROT protein database
'TrEMBL'
TrEMBL protein database
'MGD'
MGD database in the Jackson Laboratory
[Transcript ID (TPacc)]
[CDS/Longest ORF region]
(region: location format in DDBJ/EMBL/GenBank features) ' ' '+' (codon start: '1','2' or '3') ' ' (frameshift positions if exists: 'pos-X', 'pos+N')

3 Formats of files

[RELEASE]
Release file containg build date, the number of sequences and public database list
[rts.fasta]
FASTA file of RTS
[rts.txt]
RTS information in tab-delimited file
  1. RTS ID
  2. Accession number of representative transcript
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[rts.das]

Genomic coodiation of RTS. Its format is one used in LDAS

Format of annotation section (1 exon=1 line)

  1. 'RTS'
  2. RTS ID
  3. 'exon'
  4. 'similarity'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
[rps.fasta]
FASTA file of RPS
[rps.txt]
RPS information in tab-delimited file
  1. RPS ID
  2. Accession number of representative protein
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[allseq.txt]
Result of TU-based clustering
  1. RTS ID
  2. Transcript ID of representative transcript
  3. RPS ID
  4. Transcript ID of representative protein
  5. Transcript IDs of all transcripts in TU [delimiter: ' ' (single space)]
[rtps.dat]
Information about TU-based cluster/RTS/RPS (It will be obsoleted)
[vts.fasta]
FASTA file of VTS
[vts.txt]
VTS information in tab-delimited file
  1. VTS ID
  2. Accession number of representative transcript
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[vts.das]
Genomic coodiation of VTS. Its format is one used in LDAS
  1. 'VTS'
  2. VTS ID
  3. 'exon'
  4. 'similarity'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
[vps.fasta]
FASTA file of VPS
[vps.txt]
VPS information in tab-delimited file
  1. VPS ID
  2. Accession number of representative protein
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[allvseq.txt]
Results of splicing pattern-based cluster
  1. VTS ID
  2. Transcript ID of representative transcript
  3. VPS ID
  4. Transcript ID of representative protein
  5. Transcript IDs of all transcripts in variant group [delimiter: ' ']
[vtps.dat]
Information about variant-based cluster/VTS/VPS (It will be obsoleted)
[excluded_transcripts.txt]
A list of excluded transcripts in RTPS build.
  1. Accession number of excluded transcript
  2. Reason
    'Immune'
    T-cell receptor or immunoglobulin transcript
    'No info'
    There is no information to group into TUs (Not mapped & not recorded in any public database)
[cts.das]
Genomic coodiation of CTS. Its format is one used in LDAS
  1. 'CTS'
  2. CTS ID
  3. 'exon'
  4. 'similarity'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
[rtps.yaml]
Information about variant-based cluster/VTS/VPS. YAML format
'TUID':
TU ID
'RTS':
Information about a representative transcript
'RTS_ID':
RTS ID
'RTS_acc':
Accession number of the representative transcript (database reference)
'RTS_TPacc':
Transcript ID of the representative transcript
'RPS':
Information about a representative protein
'RPS_ID':
RPS ID
'RPS_acc':
Accession number of the representative protein (database reference)
'RPS_TPacc':
Transcript ID of the representative protein
'DESC':
gene names (YAML-Sequence)
'text':
gene name
'dbref':
database reference from which the gene name is retrieved
'SYMBOL':
gene symbol (YAML-Sequence)
'text':
gene symbol
'dbref':
database reference from which the gene symbol is retrieved
'SYNONYM':
synonym (YAML-Sequence)
'text':
synonyms
'dbref':
database reference from which the synonyms are retrieved
'GO':
Gene Ontology (YAML-Sequence)
'goid':
Gene Ontology ID
'evidence':
Evidence code
'dbref':
database reference from which the assignments are retrieved
'Transcripts':
Information about all transcripts in TU (YAML-Sequence)
'TPacc':
Transcript ID
'rank':
Representative rank of the transcript (1=representative, 2..n=non-representative)
'status':
`Overlap'=Transcript overlapping the multiple TUs `Not-Mapped'=Not mapped transcript, `OK'=Non-overlapped transcript
'ntlen':
Length of the transcript (bp)
'aalen':
Length of the translated seuqnece (aa)
'lorflen':
Length of the longest ORF (bp) (in case that CDS information does not present. Lower bound is 100bp)
'CDS':
CDS region in the transcript
'LORF':
Longest ORF region in the transcript
'map_chr':
mapped chromosome number
'map_strand':
mapped strand
'map_gstart':
Start position on the genome
'map_gstop':
End position on the genome
'map_gstart':
Start position on the transcript
'map_gstop':
End position on the transcript
'dbref':
Database references about the transcript
'DBREFS':
Database reference about the TU (YAML-Sequence)
'Antisense':
TU ID of antisense TUs (YAML-Sequence)
[vtps.yaml]
Information about variant-based cluster/VTS/VPS. YAML format. Its content is similar to "rtps.yaml"
[allseq.das]
Genomic coodiation of all transcripts. Its format is one used in LDAS
[allseq_lowermap.das]
Genome mapped positions of all transcripts with lower scores.

. Its format is one used in LDAS.