“Less Mosquito, More Virus”
Some of you may be wondering what I did this summer. I don’t know why you would be wondering that unless you’re a close friend of mine, as I don’t expect many strangers wake up one day and think, “Hm, I wonder what Andy Sima is doing?” Well, fear not, because much like last summer, I’m going to tell you anyway. But unlike last summer, it wasn’t nearly as exciting as Philmont. Partially that’s due to Philmont being closed because of COVID. But partially that’s also due to me staying in Illinois this year. And boy, were things different in more than just scenery.
In the summer of 2020, instead of being out hiking around mountains and teaching children how to remove fishing hooks from their fingers, I was inside on my computer coding a machine to tell me how many cases of West Nile Virus it thought there would be in any given United States county this year. No, I did not mistype that; West Nile Virus. Not Coronavirus. Fun fact, there have been more confirmed cases of COVID-19 in the last two months than there have been confirmed cases of West Nile Virus in the twenty years it’s existed within the USA. It was part of my job this summer to figure out how many there might be in the next years. The answer, according to my desktop computer: absolute jack squat.
The true answer is probably closer to a few dozen, or a couple hundred if this year is particularly bad, but that’s less important than the fact that predicting that shit is actually pretty damn hard. Let me explain. This summer I had the genuine pleasure of working for the National Center for Supercomputing Applications (NCSA) at the University of Illinois as a SPIN Intern, or a Student Pushing INnovation Intern. A Spintern, if you will. I was really nervous at first, as I had applied to the job on a whim after seeing an ad for it posted on some university bulletin board. I had some python modeling experience before, as I think I’ve mentioned previously, but not enough to merit having access to a computer with enough power to explode my own computer. Or so I thought, anyway; they must have seen something in me, or had rather low expectations, because I was hired anyway as a researcher to do work, as they put it, “Resolving health disparities by using advanced statistics on complex multidimensional datasets.” I’m still working for them and I don’t know what that means.
It was a really great summer job, honestly, and I quite enjoyed working with my coworkers and with the other staff at the NCSA. The hours were/are flexible, the pay was/is good (especially considering that I could have had an unpaid internship, but I’m glad I didn’t because unpaid internships are borderline exploitation and super problematic), and I got/am getting a lot of great experience both in a professional, scientific workplace and with computer coding. I was originally only hired for the summer session, but after a couple weeks of initial dread and terror, I realized that I enjoyed it so much that I wanted to continue working into the academic school year. I thought for sure I would hate it, but then I came out loving it. That’s either Stockholm Syndrome or a pretty great experience; I’m gonna go with the latter.
As great as the job was, though, it was always kind of a hassle trying to explain what, exactly, I did. I only ever used the supercomputer a few times, so while it was pretty cool that I got to say I worked for the Supercomputer People, my job itself was less crunching giant numbers on a massive processor and more crunching slightly smaller numbers on my own personal processor and hoping that I put my colons in the right place. It was a lot of coding, a lot of iterating over different versions of code, a lot of reading about mosquitoes, a lot of using GIS maps to figuring out where the mosquitoes might be, and at one point, a lot of looking at pictures of houses and telling a computer program called YOLO how old the house was. In the end, it was really less about the mosquitoes and the virus and more about how many different variables we can feed python until it farts out the answer we want. But don’t get me wrong, it’s actually very interesting. If you like numbers.
Technically, I was a part of two different projects. Both had to do with predicting future outbreaks of West Nile Virus, but on different scales. West Nile Virus is, by the way, a mosquito-borne illness that usually causes no symptoms, sometimes gives you a fever, and if you’re really unlucky, sometimes makes your brain swell until you die. Don’t worry, it’s pretty easy not to get if you use bug spray and avoid mosquito-heavy areas, and you’re only really at risk if you’re older. There have been a couple major outbreaks in Chicago, a few in Texas, and some others around the country. None of those outbreaks topped over a thousand major (or “neuroinvasive”) cases, and I don’t think any of them (luckily) went over 100 deaths. So a pretty small disease overall. But, as COVID has shown, even small outbreaks can easily overwhelm hospitals. So it’s nice to know where the next outbreaks of any illness might be. Enter the CDC.
The Center for Disease Control runs a program that collects different models for disease prediction from any team that wants to participate, and one of the diseases they model for is West Nile Virus. These aren’t like string and tape models, but mathematical models that use machine learning and regression techniques to predict the future based on the past. So, one of the projects I worked on was stripping down and rebuilding a previous iteration of a model that predicted the number of cases for every county in the United States, based on the previous number of cases for every county. I ended up becoming the project “expert” in oversampling and undersampling, as I ran what felt like hundreds but was really just a couple dozen tests on changing the size of the sample data. Turns out that nothing helped. There was nothing I could really change, not with my limited ability, to improve the accuracy of our model. Unfortunately, that meant that our model was predicting zero cases in every county with its own stupid 99.99% accuracy. However, I did manage to lower the model’s accuracy to a spicy 0%, so I was impressed with that.
Ultimately, this didn’t really lead anywhere, but we did learn a couple valuable things; predicting future outbreaks based only on yearly case numbers isn’t very effective, oversampling and undersampling don’t really help much in this case even thought 35,000 samples were just zero, and even including weather patterns didn’t make much difference in model accuracy. Everything just told us there will probably be zero cases next year. Which, while blindly and stupidly hopeful, is almost certainly wrong. But hey, for the next iteration of the model, we know now to take a different approach, and maybe we’ll get it right next time! That’s science!
The second project I worked on was a bit more up my alley. It was looking at, again, West Nile Virus prediction, but instead of every county, it was just predicting for two; Cook county and DuPage County, or as they’re more accurately known, Chicago and Naperville. And wouldn’t you know it, I’m from DuPage county! And this project needed GIS software to help predict the mosquitoes! It’s like it was made for me! Even though it wasn’t!
Anyway, the goal of this project was to help fill out some gaps in an ongoing manuscript from researchers at the U of I. My partner and I were looking at data on the scale of a few meters to find a way to help predict mosquitoes and West Nile Virus in the two-county area. Odd as it may seem, a recurring trend in West Nile Virus statistics in Chicagoland is that the cases are more likely to appear in areas where the houses are older and closer together, but also with some green space. Anyone from the Chicago area will tell you that this circumstances crop up pretty much anywhere you can find the Chicago Bungalow, or those squat little brick houses that form a weird semi-suburban ring around downtown Chicago. And knowing this, I thought to myself, hey, wouldn’t it be cool if you could figure out where the houses are old based on the shape of their roof as seen from a satellite image? Turns out, you absolutely can.
So that’s what we did; I convinced my mentor and work partner that we could help predict West Nile Virus risk by adding the variable of house age into the existing framework of variables. Since old houses are more likely to have cases, it seems to make sense. Why do the old houses have more cases? I’m not entirely clear, but I think it’s a combination of the fact that the houses are more likely to flood the older they get (and standing water is great for mosquitoes), and older houses tend to have older people. Some cross-section of the two probably relates house age to West Nile Virus risk, but I couldn’t tell you for sure. But the numbers don’t lie.
Initially, doing this for Cook County was pretty easy; we just had to make a map of where all the houses were and how old they were. The Cook County Tax Assessor has all this data available for free. But we ran into a roadblock that DuPage county, clearly the less bureaucratically transparent of the two, did not have this data readily available en masse. So that’s where the satellite images came into play. Through some sort of Google magic I barely understand, my coworker was able to download the entirety of DuPage county’s satellite imagery from Google Maps, and then feed it into a machine learning program that we had trained, and it returned a file containing the program’s prediction for the location and age of every house in DuPage county. Which we then promptly used to learn more about mosquitoes. Cook County was much easier, but it just goes to show; sometimes the steps of progress are many where they should be few.
So in our quest to predict who’s going to be patient zero of the next West Nile Virus outbreak, we ended up with an artificial intelligence that knows every possible detail about the shape and color of suburban roofing. It’s a little convoluted, but it’s a pretty useful tool if we want to take this study to anywhere else that doesn’t have housing age data readily available. How useful is this tool? How much did it help the overall prediction of West Nile Virus? I don’t know yet, I still haven’t quite finished up the mapping part yet. But I guess that’s why I’m still working for them. In part because I enjoy the job, and also in part because, like a vengeful ghost, I have unfinished business and will not stop haunting the trash cans and filing cabinets until I get it done. Hopefully it will contribute something. I’d like to think that all my and my coworker’s work will help, in some small way, to save lives and prepare hospitals for the future of West Nile Virus. And maybe it will, but only time will tell. Still, it was all pretty fun. And I got to flex my brain muscles. (Is that arrogant to say?)
It was, overall, a pretty computer-intensive summer, with a great deal of mucking about in python help pages, machine learning YouTube videos, Google maps, and lines beyond lines of gibberish numbers that only make sense in very specific contexts. But I feel like I learned a lot about my abilities in computer programming and scientific research, and for that, the experience was/is invaluable. So even though I made less money than I did at Philmont, and even though I climbed exponentially fewer mountains, and even though the entire world seems to be burning down around us, at least I got some great opportunities out of it. Oh, and I also got to make what is probably one of my favorite maps to date. Thank you.
Love this one! Now I know what you did this summer!!