Creation of a Machine Learning App to Facilitate Pancreatic Cancer Prediction

Jan 2020 - Jun 2021

It all started the summer before junior year in high school. While looking for a research project to conduct, I had found out that one of my good friend’s grandfathers had passed away due to pancreatic cancer. Given that he had received a late diagnosis, he was unable to survive for much time. I had been thinking of conducting a machine learning project, and after hearing that, I knew what I wanted to do — use machine learning for the early diagnosis of pancreatic cancer.

Sounds ambitious, right? It was for me at the time since I only had some experience with machine learning. I started by looking for a dataset to use (because hey, that’s the most important thing) and found one that was public and free for anyone to use — the IPUMS dataset. I started researching variables that were risk factors of pancreatic cancer and made sure to include them in the dataset. That was the easy part. The harder part was actually making these variables work with my model. I won’t get into the details here since they are fairly complicated, but essentially the models that I was using needed to have a clean, balanced dataset. That meant that it needed to be free of missing values and relatively balanced (an equal ratio of pancreatic cancer to non-pancreatic cancer cases). This took months of revisions, reworkings, re-codings, and developing methods to artificially up-sample the dataset, but eventually I was able to develop a somewhat perfect dataset that would be ready to feed into the models.

Now, let's talk about the models. I essentially had to teach myself how to train an Artificial Neural Network and Logistic Regression model using Python in TensorFlow. Using YouTube, TowardsDataScience, StackOverflow, TensorFlow’s own tutorials, and many, many, many other websites and resources, I was finally able to run my models for the first time. Comparatively, that was the easy part. The harder part was actually getting the models to perform well. This took another month or so (it went quicker because this part occurred during the COVID-19 pandemic, so I had more time on my hands) of testing every single possible combination of batch size, activation function, pre-processing technique, feature column, number of layers, number of neurons in each layer, early-stopping method, and all other hyperparameters in order to get the best result.

In the end though, after many months of work, I had a working product (if you would like to learn about every specific step in my project, you may email me and I would be more than happy to discuss) — I was finally done!

…Ehhhh no I wasn’t. I really wanted to improve this project significantly during the summer entering 12th grade. The project was far from perfect, mainly in regards to my dataset, and it didn’t have a useful way for patients to actually use the model. So, to solve those problems, I had to set out on two new paths – requesting better, additional datasets and embedding the model within a mobile app.

The former was the harder of the two. I had sent out 20 or so emails to professors and organizations around the country, receiving a response from only two, saying no. Yet one afternoon, I received an email back from an organization called PanCAN, saying that they would be able to send me a dataset. I had to fill out a TON of paperwork for them, but I was able to obtain a dataset from them, nonetheless. In addition, I had applied for a dataset from PLCO, in which I had succeeded. This was an overly time-consuming process (I guess I had not realized how much logistics would be involved), but I am satisfied with my progress. I think I have a pretty decent dataset now.

The 2nd part was much more fun. I had created a few apps prior (see the apps/websites page on my website), and I was excited to develop the new models in CreateML and place them in an app using CoreML (I had originally planned to convert the TensorFlow models to CoreML using a tool called coremltools, but it did not work for my scenarios). I won’t go into too much detail here, but I created Decision Trees, Random Forest, Boosted Trees, Logistic Regression, and Support Vector Machine models, which weren't relatively hard to do. All I needed to do was some intuitive UI design and to work around the quirks of CreateML (for instance, I had to write a program in Python just to parse results from CreateML in order to calculate an AUC value). In addition, I investigated the effects of including depression information in the models, as well as the importance of the risk factors. As of November 2020, I have completed the project and submitted it to the Regeneron STS Competition. If you would like to see my findings, read my paper below!