组学数据生物信息学

出版社:贝恩德•迈尔 (Bernd Mayer) 科学出版社 (2013-01出版)
出版日期:2013-1
ISBN:9787030359308
作者:贝恩德·迈尔
页数:584页

章节摘录

Chapter 1  Omics Technologies, Data and Bioinformatics Principles  Maria V. Schneider and Sandra Orchard  Abstract  We provide an overview on the state of the art for the Omics technologies, the types of omics data and the bioinformatics resources relevant and related to Omics. We also illustrate the bioinformatics chal-lenges of dealing with high-throughput data. This overview touches several fundamental aspects of Omics and bioinformatics: data standardisation, data sharing, storing Omics data appropriately and exploring Omics data in bioinformatics. Though the principles and concepts presented are true for the various dif-ferent technological .elds, we concentrate in three main Omics .elds namely: genomics, transcriptomics and proteomics. Finally we address the integration of Omics data, and provide several useful links for bioinformatics and Omics.  Key words: Omics, Bioinformatics, High-throughput, Genomics, Transcriptomics, Proteomics, Interactomics, Data integration, Omics databases, Omics tools  1. Introduction  The last decade has seen an explosion in the amount of biological data generated by an ever-increasing number of techniques enabling the simultaneous detection of a large number of altera-tions in molecular components (1). The Omics technologies uti-lise these high-throughput (HT) screening techniques to generate the large amounts of data required to enable a system level under-standing of correlations and dependencies between molecular components.  Omics techniques are required to be high throughput because they need to analyse very large numbers of genes, gene expression, or proteins either in a single procedure or a combina-tion of procedures. Computational analysis, i.e., the discipline now known as bioinformatics, is a key requirement for the study of the vast amounts of data generated. Omics requires the use of  Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_1, . Springer Science+Business Media, LLC 2011  3  Schneider and Orchard  techniques that can handle extremely complex biological samples in large quantities (e.g. high throughput) with high sensitivity and speci.city. Next generation analytical tools require improved robustness, .exibility and cost ef.ciency. All of these aspects are being continuously improved, potentially enabling institutes such as the Wellcome Trust Sanger Sequencing Centre (see Note 1) to generate thousands of millions of base pairs per day, rather than the current output of 100 million per day (http://www. yourgenome.org/sc/nt).  However, all this data production makes sense only if one is equipped with the necessary analytical resources and tools to understand it. The evolution of the laboratory techniques has therefore to occur in parallel with a corresponding improvement in analytical methodology and tools to handle the data. The phrase Omics � a suf.x signifying the measurement of the entire comple-ment of a given level of biological molecules and information � encompasses a variety of new technologies that can help explain both normal and abnormal cell pathways, networks, and processes via the simultaneous monitoring of thousands of molecular com-ponents. Bioinformaticians use computers and statistics to perform extensive Omics-related research by searching biological databases and comparing gene sequences and proteins on a vast scale to identify sequences or proteins that differ between diseased and healthy tissues, or more general between different phenotypes.  “Omics” spans an increasingly wide range of .elds, which now range from genomics (the quantitative study of protein coding genes, regulatory elements and noncoding sequences), transcrip-tomics (RNA and gene expression), proteomics (e.g. focusing on protein abundance), and metabolomics (metabolites and meta-bolic networks) to advances in the era of post-genomic biology and medicine: pharmacogenomics (the quantitative study of how genetics affects a host response to drugs), physiomics (physiologi-cal dynamics and functions of whole organisms) and in other .elds: nutrigenomics (a rapidly growing discipline that focuses on iden-tifying the genetic factors that in.uence the body’s response to diet and studies how the bioactive constituents of food affect gene expression), phylogenomics (analysis involving genome data and evolutionary reconstructions, especially phylogenetics) and inter-actomics (molecular interaction networks). Though in the remain-der of this chapter we concentrate on an isolated few examples of Omics technologies, much of what is said, for example about data standardisation, data sharing, storage and analysis requirements are true for all of these different technological .elds.  There are already large amounts of data generated by these technologies and this trend is increasing, for example second and third generation sequencing technologies are leading to an exponential increase in the amount of sequencing data available. From a computational point of view, in order to address the   2. Materials  2.1. Genomics High-Throughput Technologies  2.2. Transcriptomics High-Throughput Technologies  Omics Technologies, Data and Bioinformatics Principles  complexity of these data, understand molecular regulation and gain the most from such comprehensive set of information, knowledge discovery � the process of automatically searching large volumes of data for patterns � is a crucial step. This process of bioinformatics analysis includes: (1) data processing and molecule (e.g. protein) identi.cation, (2) statistical data analysis,  (3) pathway analysis, and (4) data modelling in a system wide context. In this chapter we will present some of these analytical methods and discuss ways in which data can be made accessible to both the specialised bioinformatician, but in particular to the research scientist.  There are a variety of de.nitions of the term HT; however we can loosely apply this term to cases where automation is used to increase the throughput of an experimental procedure. HT tech-nologies exploit robotics, optics, chemistry, biology and image analysis research. The explosion in data production in the public domain is a consequence of falling equipment prices, the opening of major national screening centres and new HT core facilities at universities and other academic institutes. The role of bioinfor-matics in HT technologies is of essential importance.  High-Throughput Sequencing (HTS) technologies are used not only for traditional applications in genomics and metagenomics (see Note 2), but also for novel applications in the .elds of tran-scriptomics, metatranscriptomics (see Note 3), epigenomics (see Note 4), and studies of genome variation (see Note 5). Next gen-eration sequencing platforms allow the determination of the sequence data from ampli.ed single DNA fragments and have been developed speci.cally to lend themselves to robotics and par-allelisation. Current methods can directly sequence only relatively short (300�1,000 nucleotides long) DNA fragments in a single reaction. Short-read sequencing technologies dramatically reduce the sequencing cost. There were initial fears that the increase in quantity might result in a decrease in quality, and improvements in accuracy and read length are being looked for. However, despite this, these advances have signi.cantly reduced the cost of several sequencing applications, such as resequencing individual genomes  (2) readout assays (e.g. ChIP-seq (3) and RNAseq (4)).  The transcriptome is the set of all messenger RNA (mRNA) molecules, or “transcripts”, produced in one or a population of cells. Several methods have been developed in order to gain expression information at high throughput level.  Schneider and Orchard  Global gene expression analysis has been conducted either by hybridization with oligonucleotide microarrays, or by counting of sequence tags. Digital transcriptomics with pyrophosphatase based ultra-high throughput DNA sequencing of ditags repre-sents a revolutionary approach to expression analysis, which gen-erates genome-wide expression pro.les. ChIP-Seq is a technique that combines chromatin immunoprecipitation with sequencing technology to identify and quantify in vivo protein�DNA interac-tions on a genome-wide scale. Many of these applications are directly comparable to microarray experiments, for example ChIP-chip and ChIP-Seq are for all intents and purposes the same (5). The most recent increase in data generation in this evolving .eld is due to novel cycle-array sequencing methods (see Note 6), also known as next-generation sequencing (NGS), more com-monly described as second-generation sequencing which are already being used by technologies such as next-generation expressed-sequence-tag sequencing (see Note 7).  2.3. Proteomics Proteomics is the large-scale study of proteins, particularly their High-Throughput expression patterns, structures and functions, and there are vari-Technologies ous HT techniques applied to this area. Here we explore two  main proteomics .elds: Mass Spectrometry HT and Protein� Protein Interactions (PPIs).  2.3.1. Mass Spectrometry Mass spectrometry is an important emerging method for the High-Throughput characterization of proteins. It is also a rapidly developing .eld Technologies which is currently moving towards large-scale quanti.cation of  speci.c proteins in particular cell types under de.ned conditions. The rise of gel-free protein separation techniques, coupled with advances in MS instrumentation sensitivity and automation, has provided a foundation for high throughput approaches to the study of proteins. The identi.cation of parent proteins from derived peptides now relies almost entirely on the software of search engines, which can perform in silico digests of protein sequence to generate peptides. Their molecular mass is then matched to the mass of the experimentally derived protein fragments.  2.3.2. Interactomics HT  Studying protein�protein interactions provides valuable insights  Technologiesinto many .elds by helping precisely understand a protein’s role inside a speci.c cell type, with many of the techniques commonly used to experimentally determine protein interactions lending themselves to high throughput methodologies. Complementation assays (e.g. 2-hybrid) measure the oligomerisation-assisted com-plementation of two fragments of a single protein which when united result in a simple biological readout � the two protein frag-ments are fused to the potential bait/prey interacting partners respectively. This methodology is easily scalable to HT since it can  2.4. Challenges in HT Technologies  2.5. Bioinformatics Concepts  Omics Technologies, Data and Bioinformatics Principles  yield very high numbers of coding sequences assayed in a  relatively simple experiment and a wide variety of interactions can be detected and characterised following one single, commonly used protocol. However, the proteins are being expressed in an alien cell system with a loss of temporal and physiological control of expression patterns, resulting in a large number of false-positive interactions. Af.nity-based assays, such as af.nity chromatogra-phy, pull-down and coimmunoprecipitation, rely on the strength of the interaction between two entities. These techniques can be used on interactions which form under physiological conditions, but are only as good as the reagents and techniques used to iden-tify the participating proteins. High throughput mass spectrom-etry is increasingly used for the rapid identi.cation of the participants in an af.nity complex. Physical methods depend on the properties of molecules to enable measurement of an interac-tion, as typi.ed by techniques such as X-ray crystallography and enzymatic assays. High quality data can be produced but highly puri.ed proteins are required, which has always proved a rate limiting step. Availability of automated chromatography systems and custom robotic systems that streamline the whole process, from cell harvesting and lysis through to sample clari.cation and chromatography has changed this, and increasing amounts of data are being generated by such experiments.  It is now largely the case that high throughput methods exist for all or most of the Omics domains. The challenge now is to prevent bottlenecks appearing in the storing, annotation, and analysis of the data. First the data which is required to describe both � how an experiment was performed and the results generated by it � must be de.ned. A place to store that information must be identi-.ed, a means by which it will be gathered has to be agreed upon, and ways in which the information will be queried, retrieved and analysed must also be decided. Data in isolation is of limited use, so ideally the data format chosen should be appropriate to enable the combination and comparison of multiple datasets, both in-house and with other groups working in the same area. HT data is increasingly used in a broader context beyond the indi-vidual project; consequently it is becoming more important to standardise and share this information appropriately and to pre-interpret it for the scientists who are not involved with the experi-ment, whilst still making the raw data available for those who wish to perform their own analyses.  In high throughput research, knowledge discovery starts by collecting, selecting and cleaning the data in order to .ll a data-base. A database is a collection of .les (archive) of consistent data that are stored in a uniform and ef.cient manner. A relational database consists of a set of tables, each storing records (instances).  Schneider and Orchard  A record is represented as a set of attributes which de.ne a property of a record. Attributes can be identi.ed by their name and store a value. All records in a table have the same number and type of attributes. Database design is a crucial step in which the data requirements of the application have .rst to be de.ned (concep-tual design), including the entities and their relationships. Logical design is the implementation of the database using database management systems, which ensure that the process is scalable. Finally the physical design phase estimates the workload and re.nes the database design accordingly. It is during this phase that table designs are optimized, indexing is implemented and cluster-ing approaches are optimized. These are fundamental in order to obtain fast responses to frequent queries without jeopardising the database integrity (e.g. redundancy). Primary or archived data-bases contain information directly deposited by submitters and give an exact representation of their published data, for example DNA sequences, DNA and protein structures and DNA and pro-tein expression pro.les. Secondary or derived databases are so-called because they contain the results of analysis on the primary resources, including information on sequence patterns or motifs, variants and mutations and evolutionary relationships.  The fundamental characteristic of a database record is a unique identi.er. This is crucial in biology given the large num-ber of situations where a single entity has many names, or one name refers to multiple entities. To some extent, this problem can be overcome by the use of an accession number, a primary key derived by a reference database to describe the appearance of that entity in that database. For example, using the UniProtKB pro-tein sequence database accession number of human p53 gene products (P04637) gives information on the sequence of all the isoforms of these proteins, gene and protein nomenclature as well as a wealth of information about its function and role in a cell. More than one protein sequence database exists, and the vast majority of protein sequences exist in all of these. Fortunately resources to translate between these multiple accession numbers now exist, for example the Protein Identi.er Cross-Reference (PICR) Service at the European Bioinformatics Institute (EBI) (see Note 8).  The Omics .elds share with all of biology the challenge of handling ever-increasing amounts of complex information effec-tively and .exibly. Therefore a crucial step in bioinformatics is to choose the appropriate representation of the data. One of the simplest but most ef.cient approaches has been the use of controlled vocabularies (CVs), which provide a standardised dictionary of terms for representing and managing information. Ontologies are structure CVs. An excellent example of this methodology is the Gene Ontology (GO) that describes gene products in terms of their associated biological proces.

内容概要

作者:(奥)贝恩德·迈尔(Bernd Mayer)

书籍目录

前言撰稿人第一篇 组学生物信息学基础第一章 组学技术、数据和生物信息学原理第二章 组学数据的数据标准:数据共享和重用第三章 组学数据管理和注释第四章 交叉组学研究项目的数据和知识管理第五章 组学数据的统计分析原理第六章 不同层次组学数据综合分析的统计方法和模型第七章 时序组学数据集的分析第八章 “组学”术语的恰当使用第二篇 几种常用组学数据及分析方法第九章 高通量测序数据的计算分析第十章 对照研究中的单核苷酸多态性分析第十一章 拷贝数变异数据的生物信息学分析第十二章 基于免疫共沉淀的芯片数据处理:从原始图像生成到分析结果浏览第十三章 基于基因表达谱的全局机制分析和疾病相关性第十四章 转录组数据的生物信息学分析第十五章 定性和定量蛋白组数据的生物信息学分析第十六章 质谱数据代谢组数据的生物信息学分析第三篇 实用组学生物信息学第十七章 组学数据处理过程中的计算分析流程第十八章 组学数据的整合、储存和分析策略第十九章 信号通路、相互作用网络构建和功能分析研究中组学数据的整合第二十章 时间依赖型组学数据的网络推断第二十一章 组学和文献挖掘第二十二章 组学和生物信息学在临床数据处理中的应用第二十三章 基于组学的病理和生理过程分析第二十四章 基于组学的生物标记发现中的数据挖掘方法第二十五章 癌症靶标识别的综合生物信息学分析第二十六章 基于组学的分子靶标和生物标记鉴定索引

编辑推荐

《组学数据生物信息学(研究方法与实验方案导读版)》从多个侧面对组学数据生物信息学做了详尽的介绍。本书共分三篇。第一篇介绍核心分析策略、标准分析规范、数据管理指南,以及用于分析组学数据的基本统计方法。第二篇介绍用于基因组、转录组、蛋白质组、代谢组等各种不同组学数据的生物信息学分析方法,包括基本概念和实验背景,以及原始数据预处理和深入分析的基本方法。第三篇则介绍如何利用生物信息学进行组学数据分析的实例,包括人类疾病相关生物标记鉴定和靶标识别等具体例子。本书由迈尔著。

作者简介

《实验室解决方案:组学数据生物信息学•研究方法与实验方案(导读版)》特邀本领域专业研究人员撰写,以便向读者提供一本实用指南。《实验室解决方案:组学数据生物信息学•研究方法与实验方案(导读版)》向读者展示了一个全新的研究领域——组学数据生物信息学。这一新领域交汇并整合了分子生物学、应用信息学和统计学等不同学科。 
《实验室解决方案:组学数据生物信息学•研究方法与实验方案(导读版)》内容十分详尽,全书分为三大部分。首先介绍组学数据的基本分析策略、标准化、管理指南,以及基础统计学等。接着,按基因组、转录组、蛋白质组、代谢组等不同专题介绍各种数据的特定分析策略。最后,以疾病相关生物标记和靶标鉴定等为例,说明组学生物信息学的具体应用。《实验室解决方案:组学数据生物信息学•研究方法与实验方案(导读版)》秉承Springer《分子生物学方法》系列丛书的一贯风格,阐述明晰、便于使用,各章包括专题简介、必备材料、易于操作的实验方案、疑难问题的主意事项,以及如何避免常见错误。
《实验室解决方案:组学数据生物信息学•研究方法与实验方案(导读版)》既具权威性,又力求通俗易懂,叫作为不同专业北京研究人员的理想指南,也为读者描绘了本研究领域引人入胜的图景。


 组学数据生物信息学下载



发布书评

 
 


 

外国儿童文学,篆刻,百科,生物科学,科普,初中通用,育儿亲子,美容护肤PDF图书下载,。 零度图书网 

零度图书网 @ 2024