Array ( [0] => {{short description|Database containing reference sequences of genes, proteins and transcripts}} [1] => {{infobox biodatabase [2] => |title = Refseq [3] => |logo =[[File:US-NLM-NCBI-Logo.svg|60px]] [4] => |description = curated non-redundant sequence database of genomes. [5] => |scope = [6] => |organism = [7] => |center = [[National Center for Biotechnology Information]] [8] => |laboratory = [9] => |author = [10] => |citation = [[Kim D. Pruitt|Pruitt KD]] & al. (2005) [11] => |released = [12] => |standard = [13] => |format = [14] => |url = https://www.ncbi.nlm.nih.gov/RefSeq [15] => |download = [16] => |webservice = [17] => |sql = [18] => |sparql = [19] => |webapp = [20] => |standalone = [21] => |license = [22] => |versioning = [23] => |frequency = [24] => |curation = [25] => |bookmark = [26] => |version= [27] => }}The Reference Sequence ('''RefSeq''') [[sequence database|database]]{{cite journal | vauthors = Pruitt KD, Tatusova T, Maglott DR | title = NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins | journal = Nucleic Acids Research | volume = 33 | issue = Database issue | pages = D501–D504 | date = January 2005 | pmid = 15608248 | pmc = 539979 | doi = 10.1093/nar/gki025 | author-link3 = Donna R. Maglott | author-link = Kim D. Pruitt }} is an [[Open access (publishing)|open access]], annotated and curated collection of publicly available [[nucleotide]] sequences ([[DNA]], [[RNA]]) and their [[protein]] products. RefSeq was introduced in 2000.{{cite journal | vauthors = Maglott DR, Katz KS, Sicotte H, Pruitt KD | title = NCBI's LocusLink and RefSeq | journal = Nucleic Acids Research | volume = 28 | issue = 1 | pages = 126–128 | date = January 2000 | pmid = 10592200 | pmc = 102393 | doi = 10.1093/nar/28.1.126 | author-link = Donna R. Maglott }}{{cite journal | vauthors = Pruitt KD, Katz KS, Sicotte H, Maglott DR | title = Introducing RefSeq and LocusLink: curated human genome resources at the NCBI | journal = Trends in Genetics | volume = 16 | issue = 1 | pages = 44–47 | date = January 2000 | pmid = 10637631 | doi = 10.1016/s0168-9525(99)01882-x }} This database is built by [[National Center for Biotechnology Information]] (NCBI), and, unlike [[GenBank]], provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from [[Virus|viruses]] to [[bacteria]] to [[Eukaryote|eukaryotes]]. [28] => [29] => For each [[model organism]], ''RefSeq'' aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. ''RefSeq'' is limited to major organisms for which sufficient data are available (121,461 distinct "named" [[Organism|organisms]] as of July 2022),{{Cite report |url=http://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/ |title=RefSeq Release 213 Statistics |date=11 July 2022 |publisher=[[National Library of Medicine]] |access-date=20 July 2022}} while [[GenBank]] includes sequences for any organism submitted (approximately 504,000 formally described [[species]]).{{cite journal | vauthors = Sayers EW, Cavanaugh M, Clark K, Pruitt KD, Schoch CL, Sherry ST, Karsch-Mizrachi I | title = GenBank | journal = Nucleic Acids Research | volume = 50 | issue = D1 | pages = D161–D164 | date = January 2022 | pmid = 34850943 | doi = 10.1093/nar/gkab1135 | pmc = 8690257 | doi-access = free }} [30] => [31] => == RefSeq categories == [32] => RefSeq collection comprises different data types, with different origins, so it is necessary to establish standard categories and identifiers to store each data type. The most important categories are: [33] => {| class="wikitable centered" style="text-align:center" [34] => |+ RefSeq accession categories and molecule types [35] => |- class="hintergrundfarbe6" [36] => ! Category [37] => ! Description [38] => |- [39] => | NC [40] => | Complete genomic molecules [41] => |- [42] => | NG [43] => | Incomplete genomic region [44] => |- [45] => | NM [46] => | [[Messenger RNA|mRNA]] [47] => |- [48] => | NR [49] => | [[Non-coding RNA|ncRNA]] [50] => |- [51] => | NP [52] => | [[Protein]] [53] => |- [54] => | XM [55] => | predicted [[Messenger RNA|mRNA]] model [56] => |- [57] => | XR [58] => | predicted [[Non-coding RNA|ncRNA]] model [59] => |- [60] => | XP [61] => | predicted [[Protein]] model (eukaryotic sequences) [62] => |- [63] => | WP [64] => | predicted [[Protein]] model (prokaryotic sequences) [65] => |} [66] => [67] => For more details and more categories, see [https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.entrez_queries_to_retrieve_sets_o Table 1] in [https://www.ncbi.nlm.nih.gov/books/NBK21091 Chapter 18 of the book ''The Reference Sequence (RefSeq) Database'']. [68] => [69] => == RefSeq Projects == [70] => Several projects to improve ''RefSeq'' services are currently in development by the NCBI, often in collaboration with research centers such as EMBL-EBI: [71] => [72] => * '''[[Consensus CDS Project|Consensus CDS]] (CCDS):''' This project aims to identify a core set of human and mouse [[Coding region|protein-coding regions]] and standardize sets of genes with high and consistent levels of genomic annotation quality. This project was announced in 2009 and is still in development.{{cite journal | vauthors = Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D | display-authors = 6 | title = The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes | journal = Genome Research | volume = 19 | issue = 7 | pages = 1316–1323 | date = July 2009 | pmid = 19498102 | pmc = 2704439 | doi = 10.1101/gr.080531.108 }}{{cite journal | vauthors = Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, Girón CG, Diekhans M, Barnes I, Bennett R, Berry AE, Cox E, Davidson C, Goldfarb T, Gonzalez JM, Hunt T, Jackson J, Joardar V, Kay MP, Kodali VK, Martin FJ, McAndrews M, McGarvey KM, Murphy M, Rajput B, Rangwala SH, Riddick LD, Seal RL, Suner MM, Webb D, Zhu S, Aken BL, Bruford EA, Bult CJ, Frankish A, Murphy T, Pruitt KD | display-authors = 6 | title = Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation | journal = Nucleic Acids Research | volume = 46 | issue = D1 | pages = D221–D228 | date = January 2018 | pmid = 29126148 | pmc = 5753299 | doi = 10.1093/nar/gkx1031 }} [73] => [74] => * '''RefSeq Functional Elements (RefSeqFE):''' It is focused on describing non-genic functional elements which are gene regulatory regions such as: [[Enhancer (genetics)|enhancers]], [[Silencer (genetics)|silencers]], [[DNase I hypersensitive site|DNase I hypersensitive regions]], [[Origin of replication|DNA replication origins]] etc.). The current scope of this project is restricted to the human and mouse genomes.{{cite journal | vauthors = Farrell CM, Goldfarb T, Rangwala SH, Astashyn A, Ermolaeva OD, Hem V, Katz KS, Kodali VK, Ludwig F, Wallin CL, Pruitt KD, Murphy TD | display-authors = 6 | title = RefSeq Functional Elements as experimentally assayed nongenic reference standards and functional interactions in human and mouse | journal = Genome Research | volume = 32 | issue = 1 | pages = 175–188 | date = January 2022 | pmid = 34876495 | pmc = 8744684 | doi = 10.1101/gr.275819.121 }} [75] => * '''RefSeqGene:''' Its main goal is to define genomic sequences to be used as reference standards for well-characterized genes. Previously described [[Messenger RNA|mRNA]], protein and chromosome sequences have the weaknesses of not providing explicit genomic coordinates of gene flanking and intronic regions as well as showing awkwardly large coordinates that change with every new genome assembly. The RefSeqGene project is designed to eliminate these errors.{{cite journal | vauthors = Gulley ML, Braziel RM, Halling KC, Hsi ED, Kant JA, Nikiforova MN, Nowak JA, Ogino S, Oliveira A, Polesky HF, Silverman L, Tubbs RR, Van Deerlin VM, Vance GH, Versalovic J | display-authors = 6 | title = Clinical laboratory reports in molecular pathology | journal = Archives of Pathology & Laboratory Medicine | volume = 131 | issue = 6 | pages = 852–863 | date = June 2007 | pmid = 17550311 | doi = 10.5858/2007-131-852-CLRIMP }} [76] => * '''Targeted Loci:''' This project records molecular markers, specially protein-coding and [[ribosomal RNA]] loci that are used for [[Phylogenetics|phylogenetic]] and [[DNA barcoding|barcoding analysis]]. The scope of this project includes sequences for [[Archaea]], [[Bacteria]] and [[Fungus|Fungi]] organisms, accessible via [[Entrez]] and [[BLAST (biotechnology)|BLAST]] queries. It also includes [[GenBank]] sequences for [[Animal|Animals]], [[Plant|Plants]] and [[Protist|Protists]], accessible via BLAST queries.{{Cite web |title=NCBI RefSeq Targeted Loci Project |url=https://www.ncbi.nlm.nih.gov/refseq/targetedloci/ |access-date=2022-07-27 |website=www.ncbi.nlm.nih.gov}} [77] => * '''Virus Variation (ViV):''' It is a specific resource of sequence data processing pipelines and analysis tools for display and retrieval of sequences from several viral groups such as [[Orthomyxoviridae|influenza virus]], [[ebolavirus]], [[MERS-CoV|MERS coronavirus]] or [[Zika virus]]. New viruses, processing pipelines, tools and other features are included regularly.{{cite journal | vauthors = Hatcher EL, Zhdanov SA, Bao Y, Blinkova O, Nawrocki EP, Ostapchuck Y, Schäffer AA, Brister JR | display-authors = 6 | title = Virus Variation Resource - improved response to emergent viral outbreaks | journal = Nucleic Acids Research | volume = 45 | issue = D1 | pages = D482–D490 | date = January 2017 | pmid = 27899678 | pmc = 5210549 | doi = 10.1093/nar/gkw1065 }} [78] => * '''RefSeq Select:''' This project aims to select datasets of '''RefSeq Select''' transcripts, as the most representative for every protein-coding gene, based on multiple criteria: prior use in clinical databases, transcript expression, [[Conserved sequence|evolutionary conservation]] of the coding region etc. Since many genes are represented by multiple ''RefSeq'' transcripts/proteins due to the biological process of [[alternative splicing]], this complexity is problematic for studies such as [[comparative genomics]] or exchange of clinical variant data.{{Cite web |title=NCBI RefSeq Select |url=https://www.ncbi.nlm.nih.gov/refseq/refseq_select/ |access-date=2022-07-27 |website=www.ncbi.nlm.nih.gov}} [79] => * '''MANE''' ('''M'''atched '''A'''nnotation from the '''N'''CBI and '''E'''MBL-EBI): It is a collaborative project between [[National Center for Biotechnology Information|NCBI]] and [[European Molecular Biology Laboratory|EMBL]]-[[European Bioinformatics Institute|EBI]] whose main goal is to define a set of transcripts and their proteins for all the protein-coding genes in the human genome. By doing that, the differences in transcripts annotation between ''RefSeq'' and [[Ensembl genome database project|Ensembl]]/[[GENCODE]] annotation systems are reduced. A '''MANE Select''' transcripts set are created as a useful universal standard for clinical reporting and comparative or evolutionary genomics. A second '''MANE Plus Clinical''' set are also created with additional transcripts to report all ''Pathogenic'' (P) or ''Likely Pathogenic'' (LP) clinical variants available in public resources.{{cite journal | vauthors = Morales J, Pujar S, Loveland JE, Astashyn A, Bennett R, Berry A, Cox E, Davidson C, Ermolaeva O, Farrell CM, Fatima R, Gil L, Goldfarb T, Gonzalez JM, Haddad D, Hardy M, Hunt T, Jackson J, Joardar VS, Kay M, Kodali VK, McGarvey KM, McMahon A, Mudge JM, Murphy DN, Murphy MR, Rajput B, Rangwala SH, Riddick LD, Thibaud-Nissen F, Threadgold G, Vatsan AR, Wallin C, Webb D, Flicek P, Birney E, Pruitt KD, Frankish A, Cunningham F, Murphy TD | display-authors = 6 | title = A joint NCBI and EMBL-EBI transcript set for clinical genomics and research | journal = Nature | volume = 604 | issue = 7905 | pages = 310–315 | date = April 2022 | pmid = 35388217 | doi = 10.1038/s41586-022-04558-8 | pmc = 9007741 }} This project was announced in 2018 and is expected to finish in 2022. [80] => [81] => == Statistics == [82] => According to the RefSeq release 213 (July 2022), the number of species represented in the database by counting distinct taxonomic IDs are as follows: [83] => {| class="wikitable sortable" [84] => |+ [85] => !Taxonomic ID [86] => !Species [87] => |- [88] => |[[Archaea]] [89] => | align="right" | 1443 [90] => |- [91] => |[[Bacteria]] [92] => | align="right" | 69122 [93] => |- [94] => |- [95] => |[[Fungus|Fungi]] [96] => | align="right" | 16869 [97] => |- [98] => |[[Invertebrate]] [99] => | align="right" | 5715 [100] => |- [101] => |[[Mitochondrion]] [102] => | align="right" | 13648 [103] => |- [104] => |[[Plant]] [105] => | align="right" | 9177 [106] => |- [107] => |[[Plasmid]] [108] => | align="right" | 6073 [109] => |- [110] => |[[Plastid]] [111] => | align="right" | 9430 [112] => |- [113] => |[[Protozoa]] [114] => | align="right" | 746 [115] => |- [116] => |Vertebrate ([[Mammal|mammalian]]) [117] => | align="right" | 1509 [118] => |- [119] => |[[Virus|Viral]] [120] => | align="right" | 11620 [121] => |- [122] => |Vertebrate (other) [123] => | align="right" | 5237 [124] => |- [125] => |Other [126] => | align="right" | 4 [127] => |- [128] => |Complete [129] => | align="right" | 121461 [130] => |} [131] => The counts of accession and basepairs per molecule type are: [132] => {| class="wikitable sortable" [133] => |+ [134] => !Molecule type [135] => !Accessions [136] => !Basepairs/residues [137] => |- [138] => |Genomics [139] => | align="middle" | {{Nts|40758769}} [140] => | align="middle" | {{Nts|2923212393984}} [141] => |- [142] => |RNA [143] => | align="middle" | {{Nts|45781716}} [144] => | align="middle" | {{Nts|122253022047}} [145] => |- [146] => |Protein [147] => | align="middle" | {{Nts|234520053}} [148] => | align="middle" | {{Nts|91290623940}} [149] => |} [150] => [151] => == See also == [152] => * [[GenBank]] [153] => * [[Sequence analysis]] [154] => * [[Sequence profiling tool]] [155] => * [[Sequence motif]] [156] => * [[UniProt]] [157] => * [[List of sequenced eukaryotic genomes]] [158] => * [[List of sequenced archaeal genomes]] [159] => [160] => == References == [161] => {{reflist}} [162] => [163] => ==Sources== [164] => *{{NCBI-handbook}} [165] => == External links == [166] => * [https://www.ncbi.nlm.nih.gov/RefSeq RefSeq] [167] => * [https://www.ncbi.nlm.nih.gov/books/NBK21105/#ch1.Appendix_GenBank_RefSeq_TPA_and_UniP GenBank, RefSeq, TPA and UniProt: What's in a Name?] [168] => [169] => [[Category:Genetics databases]] [170] => [[Category:National Institutes of Health]] [] => )
good wiki

RefSeq

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products. RefSeq was introduced in 2000.

More about us

About

Expert Team

Vivamus eget neque lacus. Pellentesque egauris ex.

Award winning agency

Lorem ipsum, dolor sit amet consectetur elitorceat .

10 Year Exp.

Pellen tesque eget, mauris lorem iupsum neque lacus.

You might be interested in