Title: Format of RTPS files (2003-10-10)

1 Terms

[TU (Transcriptional Unit)]
Genomic (discontinous) regions from which one mature mRNA is derived. If multiple transcripts overlap on the genome, these TUs are merged into one.
[TU-based clusters]
Grouped transcripts based on TUs from which they are transcribed.
[RTS (Representative Transcript Set)]
A set of transcripts that are selected from every TU The number of transcripts in RTS = The number of TUs
[RPS (Representative Protein Set)]
A set of proteins that are selected from every TU that has translated sequences
[variant-based clusters]
Grouped transcripts based on splicing patterns
[VTS (Variant-based representative Transcript Set)]
A set of transcripts that are selected from every splicing pattern
[VPS (Variant-based representative Protein Set)]
A set of proteins that are selected from every splicing pattern that has translated sequences
[CTS (Consensus Transcript Set)]
A set of consensus transcripts that are generated from genomic sequences in TU

2 ID format

[Variant ID]
(TU ID) '.' integer
'T' ('A' or 'B') (TU ID)
'PA' (TU ID)
'CA' (TU ID)
'T' ('A' or 'B') (Variant ID)
'PA' (Variant ID)
[References to external databases]

(DB name) '|' (ID/Accession number in the DB)

List of DB names

DDBJ/EMBL/GenBank DNA sequences
GenPept sequences (Translated peptides from DDBJ/EMBL/GenBank sequences)
NCBI RefSeq sequences
Ensembl predicted transcripts/proteins
NCBI LocusLink database
NCBI UniGene database
SWISSPROT protein database
TrEMBL protein database
MGD database in the Jackson Laboratory
[Transcript ID (TPacc)]
[CDS/Longest ORF region]
(region: location format in DDBJ/EMBL/GenBank features) ' ' '+' (codon start: '1','2' or '3') ' ' (frameshift positions if exists: 'pos-X', 'pos+N')

3 Formats of files

Release file containg build date, the number of sequences and public database list
FASTA file of RTS
RTS information in tab-delimited file
  1. RTS ID
  2. Accession number of representative transcript
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD

Genomic coodiation of RTS. Its format is one used in LDAS

Format of annotation section (1 exon=1 line)

  1. 'RTS'
  2. RTS ID
  3. 'exon'
  4. 'similarity'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
FASTA file of RPS
RPS information in tab-delimited file
  1. RPS ID
  2. Accession number of representative protein
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
Result of TU-based clustering
  1. RTS ID
  2. Transcript ID of representative transcript
  3. RPS ID
  4. Transcript ID of representative protein
  5. Transcript IDs of all transcripts in TU [delimiter: ' ' (single space)]
Information about TU-based cluster/RTS/RPS (It will be obsoleted)
FASTA file of VTS
VTS information in tab-delimited file
  1. VTS ID
  2. Accession number of representative transcript
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
Genomic coodiation of VTS. Its format is one used in LDAS
  1. 'VTS'
  2. VTS ID
  3. 'exon'
  4. 'similarity'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
FASTA file of VPS
VPS information in tab-delimited file
  1. VPS ID
  2. Accession number of representative protein
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
Results of splicing pattern-based cluster
  1. VTS ID
  2. Transcript ID of representative transcript
  3. VPS ID
  4. Transcript ID of representative protein
  5. Transcript IDs of all transcripts in variant group [delimiter: ' ']
Information about variant-based cluster/VTS/VPS (It will be obsoleted)
A list of excluded transcripts in RTPS build.
  1. Accession number of excluded transcript
  2. Reason
    T-cell receptor or immunoglobulin transcript
    'No info'
    There is no information to group into TUs (Not mapped & not recorded in any public database)
Genomic coodiation of CTS. Its format is one used in LDAS
  1. 'CTS'
  2. CTS ID
  3. 'exon'
  4. 'similarity'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
Information about variant-based cluster/VTS/VPS. YAML format
Information about a representative transcript
Accession number of the representative transcript (database reference)
Transcript ID of the representative transcript
Information about a representative protein
Accession number of the representative protein (database reference)
Transcript ID of the representative protein
gene names (YAML-Sequence)
gene name
database reference from which the gene name is retrieved
gene symbol (YAML-Sequence)
gene symbol
database reference from which the gene symbol is retrieved
synonym (YAML-Sequence)
database reference from which the synonyms are retrieved
Gene Ontology (YAML-Sequence)
Gene Ontology ID
Evidence code
database reference from which the assignments are retrieved
Information about all transcripts in TU (YAML-Sequence)
Transcript ID
Representative rank of the transcript (1=representative, 2..n=non-representative)
`Overlap'=Transcript overlapping the multiple TUs `Not-Mapped'=Not mapped transcript, `OK'=Non-overlapped transcript
Length of the transcript (bp)
Length of the translated seuqnece (aa)
Length of the longest ORF (bp) (in case that CDS information does not present. Lower bound is 100bp)
CDS region in the transcript
Longest ORF region in the transcript
mapped chromosome number
mapped strand
Start position on the genome
End position on the genome
Start position on the transcript
End position on the transcript
Database references about the transcript
Database reference about the TU (YAML-Sequence)
TU ID of antisense TUs (YAML-Sequence)
Information about variant-based cluster/VTS/VPS. YAML format. Its content is similar to "rtps.yaml"
Genomic coodiation of all transcripts. Its format is one used in LDAS
Genome mapped positions of all transcripts with lower scores.

. Its format is one used in LDAS.