Research Projects

I first became involved in science research in 10th grade after taking Dr. Truglio's "Research Methods" class.

Now, just two short years later, I have already completed two full-scale projects.

I love doing research for so many reasons. Not only is it super fun to do, but it also has a purpose in the world — to advance the human race. Which, if you ask me, is a pretty cool thing. And it's why I devote so much of my time to it when I'm not busy doing robotics or schoolwork.

Below are the major research projects that I have completed.

Creation of a Machine Learning App to Facilitate Pancreatic Cancer Prediction

January 2020 - June 2021 (may resume in college)

This is a project that I am very proud of and is one which I plan to continue in college and beyond.

It all started the summer before junior year in high school. While looking for a research project to conduct, I had found out that one of my good friend’s grandfathers had passed away due to pancreatic cancer. Given that he had received a late diagnosis, he was unable to survive for much time. I had been thinking of conducting a machine learning project, and after hearing that, I knew what I wanted to do — use machine learning for the early diagnosis of pancreatic cancer.

Sounds ambitious, right? It was for me at the time since I only had some experience with machine learning. I started by looking for a dataset to use (because hey, that’s the most important thing) and found one that was public and free for anyone to use — the IPUMS dataset. I started researching variables that were risk factors of pancreatic cancer and made sure to include them in the dataset. That was the easy part. The harder part was actually making these variables work with my model. I won’t get into the details here since they are fairly complicated, but essentially the models that I was using needed to have a clean, balanced dataset. That meant that it needed to be free of missing values and relatively balanced (an equal ratio of pancreatic cancer to non-pancreatic cancer cases). This took months of revisions, reworkings, re-codings, and developing methods to artificially up-sample the dataset, but eventually I was able to develop a somewhat perfect dataset that would be ready to feed into the models.

Now, let’s talk about the models. I essentially had to teach myself how to train an Artificial Neural Network and Logistic Regression model using Python in TensorFlow. Using YouTube, TowardsDataScience, StackOverflow, TensorFlow’s own tutorials, and many, many, many other websites and resources, I was finally able to run my models for the first time. Comparatively, that was the easy part. The harder part was actually getting the models to perform well. This took another month or so (it went quicker because this part occurred during the COVID-19 pandemic, so I had more time on my hands) of testing every single possible combination of batch size, activation function, pre-processing technique, feature column, number of layers, number of neurons in each layer, early-stopping method, and all other hyperparameters in order to get the best result.

In the end though, after many months of work, I had a working product (if you would like to learn about every specific step in my project, you may email me and I would be more than happy to discuss) — I was finally done!

…Ehhhh no I wasn’t. I really wanted to improve this project significantly during the summer entering 12th grade. The project was far from perfect, mainly in regards to my dataset, and it didn’t have a useful way for patients to actually use the model. So, to solve those problems, I had to set out on two new paths – requesting better, additional datasets and embedding the model within a mobile app.

The former was the harder of the two. I had sent out 20 or so emails to professors and organizations around the country, receiving a response from only two, saying no. Yet one afternoon, I received an email back from an organization called PanCAN, saying that they would be able to send me a dataset. I had to fill out a TON of paperwork for them, but I was able to obtain a dataset from them, nonetheless. In addition, I had applied for a dataset from PLCO, in which I had succeeded. This was an overly time-consuming process (I guess I had not realized how much logistics would be involved), but I am satisfied with my progress. I think I have a pretty decent dataset now.

The 2nd part was much more fun. I had created a few apps prior (see the apps/websites page on my website), and I was excited to develop the new models in CreateML and place them in an app using CoreML (I had originally planned to convert the TensorFlow models to CoreML using a tool called coremltools, but it did not work for my scenarios). I won’t go into too much detail here, but I created Decision Trees, Random Forest, Boosted Trees, Logistic Regression, and Support Vector Machine models, which weren't relatively hard to do. All I needed to do was some intuitive UI design and to work around the quirks of CreateML (for instance, I had to write a program in Python just to parse results from CreateML in order to calculate an AUC value). In addition, I investigated the effects of including depression information in the models, as well as the importance of the risk factors. As of November 2020, I have completed the project and submitted it to the Regeneron STS Competition. If you would like to see my findings, read my paper below!

- Final Paper


Effect of Blue Light on Planarian Sleep

January 2019 - May 2019

This was my first science research project and was part of my introductory research class.

We all had to choose our own projects, and I decided to come up with this one since I had known about the effects of blue light on sleep, and I thought it could be cool to investigate its effects on planarian, which was one of the only organisms we were allowed to use.

This was definitely a fun project. I had worked with someone in my class named Irene, and each day we would take care of the planarians by feeding them and changing their water. In order to track their movement, or "sleep", I came up with the idea to film them at a time during the night after being subjected to either blue or red light for 30 minutes.

We each built our own separate contraption at home to conduct this project and we even took planarians home with us each week to conduct the experiment!

Afterward, I took the videos we we recorded and ran them through a program called WormLab, which calculated their velocity during the time interval. It was a very time consuming process as I needed to process about 30 examples consisting of 30 minutes of footage each.

We did have a few problems along the way. We actually did a trial of this during the April break; however, all of our planarians happened to die during that break! In addition, we had planned to use a software called "ptracker" to measure the velocity of the planarians at first, but unfortunately, it did not work as expected. So we kind of scrambled at the last minute and found WormLab, which worked very well.

Overall, this was a simple yet fun first project for me. It will always be in my memories, and it was a great first dive into the world of research.

You may view our poster-board here (I apologize for the very overly blue color scheme):