Final Reflections

Since building the pipeline to analyze the alleles of an individual, I’ve been waiting anxiously to actually put it to the test, and see if any variables need to be adjusted. Of the 80 samples that we had sent out to be sequenced however, only 29 were successful. This means that rather than being halfway through the total of 160 samples that had been collected, we are only around 1/5th of the way through. Despite this, analysis has begun on what samples we have. Because of the low number of samples that were successfully sequenced however, we will be unable to draw any meaningful conclusions at this stage.

Preliminary analysis reveals that most of the individuals seem to contain around 3 alleles of MHC. As the data was only received less than a week ago, analysis of individuals is still ongoing, and nothing can be said of the population though. Testing with these new individuals has lead to minor tweaking of the variables used to trim the hierarchical cluster, as visual analysis did not quite agree with the expected results. This blog will be updated once final results are received.

Further work is needed to amplify not only the second half of the samples, but also to re-try those samples which had failed to be sequenced. Once these are done, further minor tweaking of the pipeline may be required as these new sequences are passed through–for the most part however, this part of the project is done, and running sequences through should be trivial. In order to get a better feel for what different alleles do to the structure of the MHC protein, 3d modelling will be done, to reveal what conformational changes different residues can cause. I hope to be able to work not only on getting all of the sequencing to work, but also on the 3d modeling of the proteins as well.

This project has taught me a great deal about bioinformatics and higher mathematical tools for analyses that I would never have learned about in my regular curriculum. By taking the time to truly understand how each of these tools and algorithms work, rather than just following directions for their use, I’ve gained a great appreciation for the complexities of these technologies. Most of the problems involved with analysis are what are known as NP-hard problems, which cannot be solved within the lifetime of the universe if the data being analyzed is of sufficient size. Creative solutions are needed however to determine shortcuts which arrive at close-to-optimal solutions for these problems, and it is through the understanding of the current algorithms that are in use that new ones may be discovered. Hopefully someday soon I will be contributing new ways of solving these problems to the growing frontier in this field.

Putting It All Together

Thus far, this blog has spoken about the background information on how the work I am doing works conceptually. Having gotten the concepts down, much of my work has been a simple refining process, trying to find the best way to do each task in preparation for analyzing the large stream of data which is will soon be headed my way. While most of my conceptual programming was done in C#, the actual full implementation was done using R. The language already has several of the tools which I need properly implemented–it is just a matter of making small tweaks to some of the packages to make them work together more seamlessly.

First, the file has to be read into R using the package ape’s fasta file format reader. This creates a DNA object which contains the identifying string for each read and the actual sequence from that read. The information then needs to be passed into MUSCLE, a well-established tool for aligning DNA strands which is far more advanced than the simple pattern recognition program I had tried in C#. The aligned sequences are then used to calculate a distance matrix, which establishes how related the two sequences are genetically. While my C# program did a simple comparison and calculated a percent difference, this distance matrix calculation takes more into account. The distance matrix is then transformed into a hierarchical tree, as was described in an earlier post. This tree needs to then be cut, such that the excess data is excluded. First, any nodes which contain less than 4% of the reads were removed; any node containing such a small amount of sequences is likely to be the result of an error, or to be too narrow a perspective. Next, the tree was pruned based on height, with any nodes which were more than a certain height removed from the rest of the nodes were considered to be erroneous and ignored. This narrows down the tree into several groups, which can then be made to return a list of the values in each group. Given the values of each group, a putative allele for that group may then be calculated. The result of each of these consensus allele calculations tells us how many of alleles of the MHC gene exist in that individual. Once all of the alleles are established on an individual basis, we can begin to look at a population level.

Determining the Putative Alleles of an Individual

Once the distance matrix, which compares each individual sequence to every other sequence to determine how similar the two being compared are, has been calculated, a graph may be constructed. Self-organizing maps are a type of artificial neural network which uses unsupervised learning to produce a result. While they can be used for a variety of work, they are being used in this circumstance to help group the sequences. On a basic level, how this works is that the computer creates a virtual node and places it in a random position on the graph. The computer then chooses a single data point, and the artificial node is tugged towards that point, with the exact amount of tugging determined by the programmer and called the learning rate. This process is repeated hundreds or even thousands of times, until the node converges to a position. The programmer can take a few different approaches to this, either declaring a set number of nodes to create, or can tell the computer to continue creating nodes until a condition is met.

While this may at first seem weird and unintuitive, it may be thought of another way. On our 2-dimensional graph of the reads, there are certain sections of the graph which have higher concentrations of points than others. When an artificial node enters these dense regions, it will often become “stuck,” because any movement towards a tug from a different point will make it a less accurate fit of the data. It is as if the node has become caught in a hole and cannot escape. In a way, it has, for the node has encountered something called a “local minimum.” This can be imagined by thinking of a flat sheet, which is then overlaid by the data we are using. Each of the points of data has a weight to it, and will create a dent in the sheet, meaning that larger groups will create larger dents. If a node is tugged on by a point, but moving that node would make the node fit the data less well, it will not move. Or, continuing with the metaphor, the tug is not strong enough to get the node up out of the hole. Because of this, it is important to define the strength of the tug, or learning rate, properly. If the tug is too strong, the node will never settle down properly, while if the tug is too weak, the node will become “stuck” too easily. When the computer senses that the node has become stuck, it will then generate another node in a random position, and allow it to go through the same process. Because there are likely several groups of concentrated data, there are other local minimums in which the new nodes may become caught.

Once all of our nodes have been caught, or once nodes have begun to overlap and find the same local minimums, the process is complete. Based on where the nodes have ended up, groups are created. Having nodes end up in seven different places would mean that there are seven different groups of data, and in this case, seven different versions of the MHC gene in that individual fish.

While self-organizing maps are an effective way of grouping data, they can prove to be quite sensitive to the chosen learning rate. Because of this and because of my inexperience with machine learning, I used this method purely as a backup to another tried and true method: hierarchical clustering.

Hierarchical clustering can be agglomerative, if it starts with small groups and tries to build up to the whole, or divisive, if it starts with one massive cluster of data and progressively breaks it down into smaller clusters. Essentially in divisive clustering, the computer tries to create two groups out of the cluster, and then calculates the distance between the two groups using one of various distance calculations. Unfortunately, this problem is what is known as NP-hard, in that it is non-deterministic polynomial time hard. In English, what this means is that the amount of time that it takes to do the problem goes up extremely rapidly as the number of items being solved for increases. In this case, the problem is the number of possible groups that the data may be split into when trying to find the two groups with the most distance between them. The computer needs to go through each and every combination of groups, with number of groups being equal to 2^n, where n is the number of data points. Assuming we have a mere 10 data points, this means 2^10 or 1024 possible combinations. If we increase this to 30 data points, there ares suddenly  1,073,741,824 possible combinations. Even with only 100-150 points of data as we have in this experiment, this problem is not solvable , and in fact, the time required to solve it may exceed the lifetime of the universe. Because of this, we need to do some trickery to make this a more feasible problem.

If we try to use agglomerative clustering, the problem does become a bit better, and can be reduced to n^2, where n is the number of data points, for some special cases. Having already calculated the distance matrix, it is fairly easy to begin, taking two points which are the most closely related and joining them together in the tree. This continues, going up the tree until all the groups can be traced up to the same group. If one thinks of it as a family tree, it would be like starting with yourself at the top of the tree and tracing down to your parents, then your grandparents, then great grandparents, etc. Now that we have a complete tree, we use a further bit of trickery to trim out excess data. First, we trim off branches that contain any less than 4% of the total number of data points. We can safely say that any group containing so little data is unlikely to represent a variation of the gene, but is more probably either false data or too narrow a view to be what is needed. Then, we further narrow the data by saying that any branch where the distance between the two clusters is less than 0.2 can also be ignored. This narrows our tree down to a few groups, all of which are considered putative alleles, possible variations of our gene of interest. From these, we can take each of the sequences in the group and generate a consensus allele, which is essentially the idealized version of all of the sequences in the group. This consensus allele is the gene variant from which all of the sequences came, but were mildly different from because of the fast and dirty way that next-gen sequencing works.

Now that we have the gene variants from an individual we need to repeat the same process for every other individual we have gathered. Thank goodness for automation.

Bioinformatics of Next-Gen Sequencing

The primary goal of this project is to determine the genetic differences in fish populations located upstream and downstream of various sources of pollution. It is anticipated that the polluted areas will have an impact on the relationship between fish and the surrounding pathogens. There are several ways in which this relationship could swing: the pollution may be detrimental to the pathogens, decreasing their numbers and allowing the fish to get by with comparatively weaker immune systems than the fish in nonpolluted areas. Alternatively, the pollutants may favor pathogens, resulting in an increase in their populations and diversity, causing the fish to exhibit heightened immune systems compared to fish in nonpolluted areas. The levels of pathogens may remain about the same, but different pathogen species may be more prevalent in polluted versus nonpolluted areas, resulting in different patterns of immune response. The fish themselves may also have their immune systems weakened by the pollutants, requiring them to step up their immune system to achieve the same level of success as fish in nonpolluted areas.

In order to test for these possibilities and determine the relationship between fish, pollutants, and pathogens, we are examining genes known as MHC, or major histocompatibility genes. In essence, these are the genes which detect whether proteins are self or non-self. An invading pathogen would produce proteins which differ from those produced by the fish itself, and can thus be detected and targeted to be attacked by the immune system. These MHC genes do not always occur in the same pattern as other genes, in that there are two copies, one from the mother and one from the father. Instead, MHC genes can be duplicated throughout the genome, sometimes exhibiting many more than 2 copies. While MHC genes are beneficial, too many of them can also be a bad thing. As the number of MHC genes increase, so do the number of “false positives,” in which a self cell is mistaken for non-self and thus targeted. In order to limit these mistakes, it is best to strike a balance, so that dangerous non-self pathogens can be detected, but the number of false positives are reduced. Because of this, we can use MHC genes to determine the effects of pathogens on fish in polluted environments compared to nonpolluted ones by looking at the number of alleles, or variations of the MHC gene, within a population. A high number of alleles in the population implies a greater threat from pathogens, while differences in alleles between populations up- and downstream of pollutants indicate differences in pathogen variants.

To determine the alleles in a population, a sample of the population is taken and has a subsection of their genome sequenced. Because alleles of the MHC gene differ only by small amounts, we can selectively sequence only those genes from each individual. In order to read DNA quickly and cheaply, new sequencing technologies referred to as “next-gen sequencing” (which begs the question of what we plan to call the more advanced techniques that have yet to be discovered) are used. Rather than slowly read each gene one base pair at a time, these methods rely on reading a gene several times somewhat more sloppily. This results in several reads of the same genes, most of which have gaps, or slightly different base pairs. Putting this jigsaw back together such that the gene which has been read can be viewed as one whole and definitive sequence is not easy, and it is made more complicated that each set of reads are done not on a gene-by-gene basis, but on an individual-by-individual basis. What this means is that each individual, who has two or more MHC genes in their genome, ends up with one set of data. Now, instead of putting together one jigsaw puzzle, there are multiple puzzles, though you don’t know how many.

So how does one determine not only how many alleles are in an individual, but the exact sequences of each of these alleles from a vast pool of data (in this case, around 130 reads)? First, the sequences need to be aligned. This is done pairwise, with each sequence being compared to every other sequence, and aligned with one another such that similar sections overlap.

Say one is given 3 sequences like shown below:

AGATTAGC

TTAGCTAT

GATGAGAT

Comparing the first and second sequence, we see that they overlap like so: AGA(TTAGC)TAT

The first and third sequences also overlap: GATG(AGAT)TAGC

Meaning we can line the sequences up like so:

____AGATTAGC___

_______TTAGCTAT

GATGAGAT_______

This is of course a highly simplified version of what is done, as over 100 sequences are being compared. Once the sequences are aligned, one can calculate a distance matrix, which is essentially a comparison of how similar each sequence is to every other sequence. Given this matrix, one can then begin to plot a graph of where each sequence is in relation to every other sequence in terms of similarity. This distance plot can then be used to determine groups in the data, discarding junk data and ultimately determine the alleles.

There are two different approaches to this grouping. The first is hierarchical clustering, which relies on the creation of a tree and elimination of branches which stray too far from the rest of the tree as junk data, ultimately leaving only a few nodes. The second is through the use of self-organizing maps, which create nodes that are then progressively drawn towards each point (which represents a sequence) on the graph. New nodes are created until the graph stabilizes, with each node representing a group of sequences, and thus an allele. These approaches will be explored in more detail in the next entry.