Title: Format of RTPS files (2003-10-10)

1 Terms

[TU (Transcriptional Unit)]: Genomic (discontinous) regions from which one mature mRNA is derived. If multiple transcripts overlap on the genome, these TUs are merged into one.
[TU-based clusters]: Grouped transcripts based on TUs from which they are transcribed.
[RTS (Representative Transcript Set)]: A set of transcripts that are selected from every TU The number of transcripts in RTS = The number of TUs
[RPS (Representative Protein Set)]: A set of proteins that are selected from every TU that has translated sequences
[variant-based clusters]: Grouped transcripts based on splicing patterns
[VTS (Variant-based representative Transcript Set)]: A set of transcripts that are selected from every splicing pattern
[VPS (Variant-based representative Protein Set)]: A set of proteins that are selected from every splicing pattern that has translated sequences
[CTS (Consensus Transcript Set)]: A set of consensus transcripts that are generated from genomic sequences in TU

2 ID format

[TU ID]

integer

[Variant ID]

(TU ID) '.' integer

[RTS ID]

'T' ('A' or 'B') (TU ID)

[RPS ID]

'PA' (TU ID)

[CTS ID]

'CA' (TU ID)

[VTS ID]

'T' ('A' or 'B') (Variant ID)

[VPS ID]

'PA' (Variant ID)

[References to external databases]

(DB name) '|' (ID/Accession number in the DB)

List of DB names

'GB': DDBJ/EMBL/GenBank DNA sequences
'GP': GenPept sequences (Translated peptides from DDBJ/EMBL/GenBank sequences)
'REFSEQ': NCBI RefSeq sequences
'ENSEMBL': Ensembl predicted transcripts/proteins
'LocusLink': NCBI LocusLink database
'UniGene': NCBI UniGene database
'SWISSPROT': SWISSPROT protein database
'TrEMBL': TrEMBL protein database
'MGD': MGD database in the Jackson Laboratory

[Transcript ID (TPacc)]

without translated sequences (transcript DB name) '|' (Acc No. of transcript)
with translated sequences (transcript DB name) '|' (Acc No. of transcript) '|' (Protein DB name) '|' (Acc No. of protein)

[CDS/Longest ORF region]

(region: location format in DDBJ/EMBL/GenBank features) ' ' '+' (codon start: '1','2' or '3') ' ' (frameshift positions if exists: 'pos-X', 'pos+N')

3 Formats of files

[RELEASE]

Release file containg build date, the number of sequences and public database list

[rts.fasta]

FASTA file of RTS

[rts.txt]

RTS information in tab-delimited file

RTS ID
Accession number of representative transcript
Definition in GenBank/RefSeq
LocusLink ID
Definition in LocusLink
UniGene ID
Definition in UniGene
MGI Gene Marker ID
Definition in MGD

[rts.das]

Genomic coodiation of RTS. Its format is one used in LDAS

[reference] section
```
Information about genome sequences
```

[annotation] section

Anntation on the genome (Mapped positions of transcripts)

Format of annotation section (1 exon=1 line)

'RTS'
RTS ID
'exon'
'similarity'
Chromosome reference
Start position on the genome
End position on the genome
strand ('+' or '-')
'.'
%-identity
Start position on the transcript
End position on the transcript

[rps.fasta]

FASTA file of RPS

[rps.txt]

RPS information in tab-delimited file

RPS ID
Accession number of representative protein
Definition in GenBank/RefSeq
LocusLink ID
Definition in LocusLink
UniGene ID
Definition in UniGene
MGI Gene Marker ID
Definition in MGD

[allseq.txt]

Result of TU-based clustering

RTS ID
Transcript ID of representative transcript
RPS ID
Transcript ID of representative protein
Transcript IDs of all transcripts in TU [delimiter: ' ' (single space)]

[rtps.dat]

Information about TU-based cluster/RTS/RPS (It will be obsoleted)

[vts.fasta]

FASTA file of VTS

[vts.txt]

VTS information in tab-delimited file

VTS ID
Accession number of representative transcript
Definition in GenBank/RefSeq
LocusLink ID
Definition in LocusLink
UniGene ID
Definition in UniGene
MGI Gene Marker ID
Definition in MGD

[vts.das]

Genomic coodiation of VTS. Its format is one used in LDAS

'VTS'
VTS ID
'exon'
'similarity'
Chromosome reference
Start position on the genome
End position on the genome
strand ('+' or '-')
'.'
%-identity
Start position on the transcript
End position on the transcript

[vps.fasta]

FASTA file of VPS

[vps.txt]

VPS information in tab-delimited file

VPS ID
Accession number of representative protein
Definition in GenBank/RefSeq
LocusLink ID
Definition in LocusLink
UniGene ID
Definition in UniGene
MGI Gene Marker ID
Definition in MGD

[allvseq.txt]

Results of splicing pattern-based cluster

VTS ID
Transcript ID of representative transcript
VPS ID
Transcript ID of representative protein
Transcript IDs of all transcripts in variant group [delimiter: ' ']

[vtps.dat]

Information about variant-based cluster/VTS/VPS (It will be obsoleted)

[excluded_transcripts.txt]

A list of excluded transcripts in RTPS build.

Accession number of excluded transcript
Reason

'Immune'

T-cell receptor or immunoglobulin transcript

'No info'

There is no information to group into TUs (Not mapped & not recorded in any public database)

[cts.das]

Genomic coodiation of CTS. Its format is one used in LDAS

'CTS'
CTS ID
'exon'
'similarity'
Chromosome reference
Start position on the genome
End position on the genome
strand ('+' or '-')
'.'
%-identity
Start position on the transcript
End position on the transcript

[rtps.yaml]

Information about variant-based cluster/VTS/VPS. YAML format

'TUID':

TU ID

'RTS':

Information about a representative transcript

'RTS_ID':: RTS ID
'RTS_acc':: Accession number of the representative transcript (database reference)
'RTS_TPacc':: Transcript ID of the representative transcript

'RPS':

Information about a representative protein

'RPS_ID':: RPS ID
'RPS_acc':: Accession number of the representative protein (database reference)
'RPS_TPacc':: Transcript ID of the representative protein

'DESC':

gene names (YAML-Sequence)

'text':: gene name
'dbref':: database reference from which the gene name is retrieved

'SYMBOL':

gene symbol (YAML-Sequence)

'text':: gene symbol
'dbref':: database reference from which the gene symbol is retrieved

'SYNONYM':

synonym (YAML-Sequence)

'text':: synonyms
'dbref':: database reference from which the synonyms are retrieved

'GO':

Gene Ontology (YAML-Sequence)

'goid':: Gene Ontology ID
'evidence':: Evidence code
'dbref':: database reference from which the assignments are retrieved

'Transcripts':

Information about all transcripts in TU (YAML-Sequence)

'TPacc':: Transcript ID
'rank':: Representative rank of the transcript (1=representative, 2..n=non-representative)
'status':: `Overlap'=Transcript overlapping the multiple TUs `Not-Mapped'=Not mapped transcript, `OK'=Non-overlapped transcript
'ntlen':: Length of the transcript (bp)
'aalen':: Length of the translated seuqnece (aa)
'lorflen':: Length of the longest ORF (bp) (in case that CDS information does not present. Lower bound is 100bp)
'CDS':: CDS region in the transcript
'LORF':: Longest ORF region in the transcript
'map_chr':: mapped chromosome number
'map_strand':: mapped strand
'map_gstart':: Start position on the genome
'map_gstop':: End position on the genome
'map_gstart':: Start position on the transcript
'map_gstop':: End position on the transcript
'dbref':: Database references about the transcript

'DBREFS':

Database reference about the TU (YAML-Sequence)

'Antisense':

TU ID of antisense TUs (YAML-Sequence)

[vtps.yaml]

Information about variant-based cluster/VTS/VPS. YAML format. Its content is similar to "rtps.yaml"

[allseq.das]

Genomic coodiation of all transcripts. Its format is one used in LDAS

[allseq_lowermap.das]

Genome mapped positions of all transcripts with lower scores.

. Its format is one used in LDAS.