Background on Molecular Biology
To get started with understanding the fundamentals of microbial ecology we need to cover some basics of molecular biology. The central "dogma," or deeply held truth, of molecular biology basically describes the process of how cells make proteins:
Some quick terms if it's been a while since you've taken a biology class.
DNA (deoxyribonucleic acid)-- double stranded-- very stable
RNA (ribonucleic acid)-- single stranded-- very unstable-- mostly know as messenger RNA or mRNA. The other form is rRNA or ribosomal RNA which is what we are talking about today
Proteins-- mostly enzymes that make reactions happens and made from amino acids. I think of them as the "capital equipment" of the cellular world.
When a cell wants to make a protein, usually an enzyme, it opens up it's DNA where the corresponding gene that encoded the protein is located. It transcribes the ATGC's of DNA to a single stranded mRNA. The RNA finds a ribosome (pictured below) which reads it and puts the corresponding amino acids together to make a protein.
Great! Easy right? This is where is gets a little strange. The ribosome above is a protein, however, only the purple is protein. All the orange-tan stuff is actually RNA! This particular RNA strand is encoded by the 16S DNA gene. This is the gene we are sequencing to identify bacteria.
This rRNA "breaks" the central dogma of biology because RNA is actually being used as a "protein" instead in order to make a protein. It also answers the chicken and the egg question of: What came first proteins or DNA? Neither really, RNA likely came first. A Nobel Prize in Chemistry was awarded for this discovery in 1989 to Dr. Thomas R. Cech for his study of RNA of microorganisms. Interestingly, the RNA he used for his studies was actually isolated and originally identified by Dr. Norman R. Pace, one of, if not, the greatest microbial ecologist of all time. He also collaborated with Dr. Carl Woese in which they used this gene to map out the modern tree of life (pictured below middle). He did most of his work at CU Boulder, just down the road from my alma mater, which provided me with the privilege of hearing him speak several times.
If you read any single academic publication in your life I would recommend: A Molecular View of Microbial Diversity and the Biosphere by Dr. Pace written in 1997.
My favorite quote:
At most only a few of these microbes would be known to us; only about 5000 noneukaryotic organisms have been formally described (in contrast to the half-million described insect species). We know so little about microbial biology, despite it being a part of biology that looms so large in the sustenance of life on this planet.
It seems crazy that it wasn't really until 1997 that we began to realize the true diversity of microorganisms all around us, just unseen. The majority of Dr. Pace's work was done without next-gen sequencing and powerful computational tools. We can now nearly automate what used to take a decade of work by many labs, into about 30 days with 2 people (assuming there isn't a pandemic). This of course is only possible with the foundational work done before by people like Dr. Pace.
So why are we talking about this?
Why does all this 16S gene stuff matter? This matters because pretty much all life has this gene. If we are looking for a "universal biomarker gene" to identify as much life as possible, this gene is great for that reason. Additionally, because parts of the gene are essential to life, they haven't mutated much in hundreds of millions of years. This makes it possible to design primers to amplify parts of this gene from ideally all life. These are called "conserved regions" of DNA. Interestingly, some non-essential parts, or "un-conserved regions" of the 16S gene are considered "hyper-variable." The differences in this hyper variable regions allow us to use this gene to identify bacteria based on how similar or different they are. The mutations can also be traced to form what is called a phylogenetic tree like the one in the middle of the pictures above.
You can think of the 16S gene as a barcode that all life has that can be compared to other barcodes. The more similar the "barcode" of ATCG's are the more closely related the organisms are. It was at one point established that all 16S gene sequences with <3% difference in base pairs would equate to a "species." The definition of a species in the microbial world is fundamentally different than in our macro world. This causes some problems with using traditional ecological metrics, which I'll try to touch on later; but ultimately the lines of what a "species" is get really blurry.
Now the state of the field is not clustering sequences at all but to look at each unique sequence independently. As you can imagine, this leads to incredible complexity very quickly. This is why we need ecological metrics and statistical models... so we humans can try to start to understand what it means. You can think of them as a data filter, because you are going to get nowhere fast looking strictly at a list of >500 different species in diverse systems.
In the upcoming posts, I am going to go through the set of sequencing data I briefly presented at the NC State conference. This data set is only 17 total samples and has mostly bokashi samples and some teas, manure, and vermicompost. I think this will be a good starting point to understand some of the basic diversity metrics and how we can use them to compare samples.
Any questions, please ask away. I hope this is the most technical post I'll make.