This campaign is to raise money that will allow me to continue my research in genotype/phenotype mapping. For a very short primer, you can check out my poster in the link above... or for a much more in-depth view of the project, please continue to read.
What is the significance of this research? The end goal of this project is to complete a computer program which will be capable of finding the set of genes responsible for given phenotypic traits (including genetic diseases).
This program uses an algorithm which essentially improves upon the idea of a whole-genome association study. The program uses an evolutionary algorithm for supervised learning on whole genomes. The program takes two datasets; one is a list of all variants from whole genomes of people with a particular phenotype (such as a genetic disease). The other is also a list of all variants from whole genomes, but from members without the phenotype. From this information, the program is able to find the genes and variants responsible for the phenotype, even in situations where there are more than one genotype which corresponds to that phenotype.
Woah, what does that mean?
Ok, I'll back up. A whole genome is a set of all of the genetic information from one person. It contains all of the base pairs (fundamental genetic code) in one person's DNA. A variant is one variation of a gene. For instance, if a particular gene determines eye color, a particular variant of that gene will select green eyes over brown.
A phenotype is a physical of physiological characteristic of a person, plant, or animal. The genotype for that phenotype is the set of genes (and the particular variants of each of those genes) which creates this characteristic. Many phenotypes are not decided by just one gene, but multiple genes working in concert. Some genetic diseases, such as Multiple Sclerosis (MS), are due not just to one gene, but a complex interaction of multiple genes. In fact, it gets just a bit more complex... some diseases -- including MS -- can have more than one genotype! This means that there is more than one set of genes and gene variants which can cause the set of symptoms which are diagnosed as MS by a doctor.
Conventionally, to find genotypes for a particular phenotype, researchers perform something called a 'whole-genome association study'. To perform this kind of test, the sets of all variants of all genes for people with and without the phenotype are used. T-tests, a type of statistical tool, are used to determine the correlation factor between each variant and the phenotype.
So what's wrong with that, and how does my process improve upon the existing method? The existing method is statistically based, and only considers one variant individually at a time. It has difficulty when presented with situations in which more than one genotype exists for a phenotype. It also has other problems which affect its accuracy.
My process uses a computational method called an evolutionary algorithm (think neural nets, but different) to 'evolve' the set of variants which correspond to the phenotype. If more than one genotype exists, my process does not have the same difficulty finding the individual genotypes.
So far I have only used artificial data, and on partial genomes (thousands of base pairs, and tens of genes). To scale up to whole genomes (billions of base pairs and tens of thousands of genes), I need better equipment. Also, to get real genetic data, at least some of it will need to be paid for.
So far, on artificial data, my algorithm has had 100% accuracy in situations where only one genotype exists for a phenotype. On situations where more than one genotype exists, my method has been fully accurate 60% of the time, 24% of the time has had type-II error (missing genes in at least one genotype), and 16% of the time has had type-I error (extra genes in at least one genotype). However, I believe with a high level of certainty that I can eliminate the type-II error completely with your help. I believe that I already know how to accomplish this, but need funding to continue the project.
Although I so far have only been able to use artificial data, and on smaller scales than a whole-genome, I do not currently know of any reason why my method could not be scaled up to whole-genomes, with the same success rate on real as on artificial data.
The end results that I predict, even with a few thousand dollars of assistance, is to have a program which will be able to operate on whole-genomes of real people, and even in situations where more than one genotype exists, have at least 84% complete accuracy. The other 16% may still have type-I error (extra predicted gene variants in at least one genotype). But to put this in perspective for you, it is much better to have type-I than type-II error, because type-I error could be eliminated through other means, such as micro-array analysis. Type-I error can also be mitigated simply by having a larger dataset of persons with and without the phenotype of interest.
There is one more significant advantage to my method – I don’t need many observations. It is still relatively expensive to sequence genomes. In my tests with artificial genomes, I have been able to successfully isolate genotypes with as few as two phenotype-positive observations (persons with the disease). So, prospectively, it may only take sequenced genomes of only a few people with a disease to successfully isolate its genotype.
So... bottom line... what I believe this project will result in is a way (with micro-array analysis to help eliminate type-I error) to reliably find all genes and variants (and none extra) associated with any given phenotypic trait. This includes all genetic diseases. This is very exciting, because with that information, cures for previously 'incurable' diseases could be prospectively developed.
With accurate genomic maps provided by this project, medical professionals would be able to deliver to patients 'healthy genes' to replace ones which are causing the disease. Gene therapy, which uses virus vectors to deliver healthy genetic material, is already possible and has been performed to eliminate some types of leukemia.
The idea of gene therapy is to use an altered virus to 'infect' a subset of a patient's cells with healthy genetic material. Not all of the patients cells can be transformed, but the possibility may exist that if enough cells are transformed and begin producing the correct proteins, the symptoms of the disease can be alleviated to an arbitrary degree.
For this project, I need funding for:
- Better hardware – I plan to purchase a better computer and a Tesla GPU to speed processing of larger datasets.
- Some programming assistance – I currently do not know CUDA, which is necessary for the use of a GPU, so I plan to contract out that work to someone else.
- Data! – I can always use more money for data, and the more data the better. I plan to have several persons with MS to have their genomes sequenced for me. Each sequencing runs about $2000.
Post fundraising and completion of the remainder of these current technical goals, I will have a product capable of taking a sequenced set of genomes for a particular disease and finding the genotype for that disease. In addition to disease research, the technology could be used for various other genetic engineering applications. When I reach this stage, I plan to start a company which will use and lease the technology. For this I will eventually start a second round of fundraising.
Thank you very much for considering a donation to this project!!