Once the distance matrix, which compares each individual sequence to every other sequence to determine how similar the two being compared are, has been calculated, a graph may be constructed. Self-organizing maps are a type of artificial neural network which uses unsupervised learning to produce a result. While they can be used for a variety of work, they are being used in this circumstance to help group the sequences. On a basic level, how this works is that the computer creates a virtual node and places it in a random position on the graph. The computer then chooses a single data point, and the artificial node is tugged towards that point, with the exact amount of tugging determined by the programmer and called the learning rate. This process is repeated hundreds or even thousands of times, until the node converges to a position. The programmer can take a few different approaches to this, either declaring a set number of nodes to create, or can tell the computer to continue creating nodes until a condition is met.
While this may at first seem weird and unintuitive, it may be thought of another way. On our 2-dimensional graph of the reads, there are certain sections of the graph which have higher concentrations of points than others. When an artificial node enters these dense regions, it will often become “stuck,” because any movement towards a tug from a different point will make it a less accurate fit of the data. It is as if the node has become caught in a hole and cannot escape. In a way, it has, for the node has encountered something called a “local minimum.” This can be imagined by thinking of a flat sheet, which is then overlaid by the data we are using. Each of the points of data has a weight to it, and will create a dent in the sheet, meaning that larger groups will create larger dents. If a node is tugged on by a point, but moving that node would make the node fit the data less well, it will not move. Or, continuing with the metaphor, the tug is not strong enough to get the node up out of the hole. Because of this, it is important to define the strength of the tug, or learning rate, properly. If the tug is too strong, the node will never settle down properly, while if the tug is too weak, the node will become “stuck” too easily. When the computer senses that the node has become stuck, it will then generate another node in a random position, and allow it to go through the same process. Because there are likely several groups of concentrated data, there are other local minimums in which the new nodes may become caught.
Once all of our nodes have been caught, or once nodes have begun to overlap and find the same local minimums, the process is complete. Based on where the nodes have ended up, groups are created. Having nodes end up in seven different places would mean that there are seven different groups of data, and in this case, seven different versions of the MHC gene in that individual fish.
While self-organizing maps are an effective way of grouping data, they can prove to be quite sensitive to the chosen learning rate. Because of this and because of my inexperience with machine learning, I used this method purely as a backup to another tried and true method: hierarchical clustering.
Hierarchical clustering can be agglomerative, if it starts with small groups and tries to build up to the whole, or divisive, if it starts with one massive cluster of data and progressively breaks it down into smaller clusters. Essentially in divisive clustering, the computer tries to create two groups out of the cluster, and then calculates the distance between the two groups using one of various distance calculations. Unfortunately, this problem is what is known as NP-hard, in that it is non-deterministic polynomial time hard. In English, what this means is that the amount of time that it takes to do the problem goes up extremely rapidly as the number of items being solved for increases. In this case, the problem is the number of possible groups that the data may be split into when trying to find the two groups with the most distance between them. The computer needs to go through each and every combination of groups, with number of groups being equal to 2^n, where n is the number of data points. Assuming we have a mere 10 data points, this means 2^10 or 1024 possible combinations. If we increase this to 30 data points, there ares suddenly 1,073,741,824 possible combinations. Even with only 100-150 points of data as we have in this experiment, this problem is not solvable , and in fact, the time required to solve it may exceed the lifetime of the universe. Because of this, we need to do some trickery to make this a more feasible problem.
If we try to use agglomerative clustering, the problem does become a bit better, and can be reduced to n^2, where n is the number of data points, for some special cases. Having already calculated the distance matrix, it is fairly easy to begin, taking two points which are the most closely related and joining them together in the tree. This continues, going up the tree until all the groups can be traced up to the same group. If one thinks of it as a family tree, it would be like starting with yourself at the top of the tree and tracing down to your parents, then your grandparents, then great grandparents, etc. Now that we have a complete tree, we use a further bit of trickery to trim out excess data. First, we trim off branches that contain any less than 4% of the total number of data points. We can safely say that any group containing so little data is unlikely to represent a variation of the gene, but is more probably either false data or too narrow a view to be what is needed. Then, we further narrow the data by saying that any branch where the distance between the two clusters is less than 0.2 can also be ignored. This narrows our tree down to a few groups, all of which are considered putative alleles, possible variations of our gene of interest. From these, we can take each of the sequences in the group and generate a consensus allele, which is essentially the idealized version of all of the sequences in the group. This consensus allele is the gene variant from which all of the sequences came, but were mildly different from because of the fast and dirty way that next-gen sequencing works.
Now that we have the gene variants from an individual we need to repeat the same process for every other individual we have gathered. Thank goodness for automation.