For the very first time in our history, we can collect incredible amounts of data at scale. Modern data infrastructure enables the documentation and recording of billions of species and data science allows us to collect, analyze, predict and slow down the speed of extinction. Janet walks us through her extensive research as well as share tips on infrastructure stacks and strategies for processing massive amounts of information.
Gretchen DeKnikker: All right. Welcome back everybody. We are here for the second section today. Darn it, I just cannot get the camera right. I’m going to turn into a millennial by the end of this, trying to get the right angle. Today we are recording these. You will be able to get access to them later. There is lots of chat activity going on if you want to hover over the chat button. Janet will have a Q and A session at the end of this, so use the Q and A button right there below. Hopefully you guys got some coffee and are ready, because this next talk is going to be amazing. Janet has 15 years of experience in big data, data science, working in labs, long before it was called big data, I’m sure, Janet has been rocking this. And she’s currently the Chief Data Scientist at Western Digital, here to talk to us today about data and climate change. Without further ado, Janet, please.
Janet George: Thank you. I’m going to start sharing my screen here in one minute. All right. Okay. Can you see my screen? Can everybody see my screen?
Gretchen DeKnikker: We can.
Janet George: Okay. Very good. I’m going to get started. I wanted to start out with a little bit of background about myself and how I came to be interested in climate change. Background, as she mentioned, I’m currently with Western Digital. I’ve also worked with companies like Apple and eBay and Yahoo in prior lifetimes. My educational background, I have a bachelor’s and a master’s degree in computer science with a focus on distributed computing, parallel processing, and specifically cognitive computing. My specialization is in artificial intelligence. I do a lot of stuff with CNN, convolution neural networks, RNN, which is recurrent neural networks, and also DNN, which now is deep neural networks, which has gained a lot of traction.
Janet George: How I came to do work around climate is related to my passion. And as some of you know, my passion is nature and sustainable ecosystems. I am a strong believer that we should leave the Earth better than we found it. I’m very interested in oceans, lakes, biodiversity, and really the preservation of natural habitats as we finish our journey on this Earth. With that, I’m actually going to talk a little bit about climate, climate data, and climate change.
Janet George: One of the questions, and I’m going to go through topics. How do we collect, normalize and parse data at the scale at which this data is available? And what data is really available around climate change? The data around climate change, we’ve got a lot of data around climate change. We have data that’s sitting in multiple data banks. This is historical data. We also have data that is found with the USGS. We have new data that’s coming up with sensors that are buried in the ground to watch our insects, to watch our birds, and to watch our plant population. We have lots of weather data from satellites and weather stations. We have atmospheric data, CO2 levels rising, heat waves, and things like that. We have very accurate data around sea level rise and precipitation and frost.
Janet George: More recently, we’ve also been getting data from National Geographic and other image data, which is actually quite new for us. This data’s coming–that’s from photographs that have been taken all around the world. And this gives us a very good idea about our shrinking glaciers. And the goal is to bring all this data, and so what I tend to do is we bring this data, we write agents, ingest agents, that can ingest this data into some sort of a data lake. And later, I’ll talk about what the size of this data lake should be and how large or small it should be. But the goal is to start uniting this data, because when you take data and you look at one dimension of the data, for example, if you’re only looking at insects, you may not get the whole picture.
Janet George: We want to get a full 360 degree view of our data with respect to insects, plants, heat waves, sea level rise, what’s happening to all different parts. And so the goal is to then bring this data into the data lake, so we have a unified mechanism to actually start looking at this data. And the data lake actually is an object store, so it’s a scalable object store. Then you build this data lake. You can just keep adding more notes to the data lake. Another advantage of having a separate lake versus a compute is the ability to allow storage to grow indefinitely and compute the grow indefinitely as the scale of data becomes much larger. And usually we start with the small scale, and then we can grow up to as big a scale as we want.
Janet George: The next topic I want to answer is around the focus. How do we focus? Climate change is so huge. It’s so big. It almost seems like, I call it the big, hairy, audacious goal. How do we tackle this big, hairy, audacious goal? And where do we focus? Should we look at the flora and the fauna? Should we look at weather, country, regions? There’s many, many variables that we can go after. And so what’s our focus area? Focus for me, because I come from a strong artificial intelligence machine learning background, I always look at the problem domain and form a hypothesis on what are the most critical variables that we need to watch for that directly informs us about predictions, or directly informs us about a metric that we can use to understand how we are doing in terms of forward progress or backward progress.
Janet George: If we see sea level rise, it’s a primary variable. And it’s somewhat of an independent variable, that is a very strong signal for many things that happen to climate change. So we spend a lot of time focusing on sea level rise and the consequence of the sea level rise and its direct impact for us. We also pick other variables like CO2. Now we have known that in countries, advanced countries, especially Europe, UK, many of these countries have focused on really taking action with reducing the carbon footprint, and have seen direct benefits. Those are some of the areas we want to focus here. Also, scientists are spending a lot of time trying to figure out how to harmonize carbon levels and how to make sure that we can reduce carbon levels by our actions. We know that these two variables are dependent on many things that are happening to us right now.
Janet George: For example, when the sea level rises, we see disappearing land. We see disappearing insect populations. We see tropical storms being much more severe than we’ve ever experienced in the past. We see our melting, ice melting. We see the loss of snow. We see drought that is persistent for years in a row. We see disappearing habitats. Heat waves take a different form. We experience heat waves like we’ve never experienced in generations prior. And we are seeing a lot of species, these invasive species that are surviving through these very high temperatures. And these are usually in the form of pests, which is not very good for the immune system of our habitat. They attack the immune system, the natural immune system of the habitat. Those are not good for us.
Janet George: And we are also seeing other things that are happening to us, like diminished plant population. And when we see that the plant population is diminishing, this has a direct effect on all of us because our healthy food sources disappear along with that. And so that’s something that we want to pay attention to.
Janet George: Next, the question was asked: What are my interesting discoveries around climate change? And so how would I ignore data that has a lot of false positives? And what have I discovered along the way? I think one of the biggest discoveries I’ve made as I’ve studied and looked into this data is that we have a lot of different species that are useful to us and help us along the way. Right? These species range from about 10 million to 14 million. And because of our history and where we came from and our infrastructure and compute, we only documented 1.2 million of these species that are captured. So this is a huge gap between what exists and what is actually captured.
Janet George: And today for the first time, we actually have this huge opportunity. We’re in this era where we can capture all of this. We can capture the current species. The problem with not being able to document these species, or being sporadically documenting these species, is the fact that we don’t understand how and when they become extinct. And when we don’t know the species and the rate at which it is reaching extinction, we are experiencing loss. And this loss is very severe. Now we can use big data and artificial intelligence. It is a right problem domain. We can use a lot of convolution neural networks. And I’ve been doing a lot of image analysis using convolution neural networks for insects, watching the different kinds and types of insects, classifying them, and also clustering the different species and documenting them so we can predict when they will be extinct, and the rate at which they are growing and why they are becoming extinct. What factors are contributing to their extinction and so on and so forth?
Janet George: One of the studies that have come up, and you can Google most of these studies, the one research paper that has come out that is very, very interesting is around the hyper alarming decrease in insect populations. Now you might know that insects are super critical. They’re a foundation for us in our plant economy. When we see 76% decrease in flying insects in just a matter of couple of years, we’re not talking a decade, we are not talking five years. We’re just talking year to year. That’s a crisis in our biodiversity. And there’s serious ramifications in habitat loss. Note that 35% of the world’s plant crops are pollinated by flying insects, so these are very, very important for crosspollination and maintaining the delicate balance of our natural ecosystems.
Janet George: What kind of infrastructure and what kind of investment is required? Is the problem so big that it cannot be tackled? Or is the problem bite size, and we can chew? And as a scientist, how do I come into the space? And what can I do, and where can I start? That is what I’m going to answer in this next slide. If you think about, we talked about the data lake, if you think about how economies of scale have allowed us to build very easy big data distributed computing stack, we can actually start very, very small. We can build on bare metal. We can use commodity hardware. There’s so much software that’s available to us, and AI algorithms that are available to us in open source. You can use Google’s inception for network. Or you can use Facebook’s PMASK CNN. You can use all of these technologies that are available to you. I am a big believer that you start small. And when you start small, you start with an investment of a couple million dollars. And based on how big your data becomes, if you have one petabyte of data, then you can do very well with a small compute infrastructure.
Janet George: And then you can grow out that compute infrastructure to as large as you want it to be. And that’s why the price tag is really based on the scale of data you want to process. But on the upper scale, if you think of processing all of those, like 14 million insect data, and much more than that, we’re not talking about a very large investment. We’re talking about up to 25 million in bare metal, compute and memory and storage, like data lake. This is not a very massive investment. Traditionally, building infrastructures with big companies and having a [inaudible 00:14:20] software that’s sitting in an IT department, organizations, enterprises tend to pay $50 million, to $100 million, to sometimes up to $200 million on infrastructure alone. We’re talking about $5 million, to $10 million, to $25 million dollars. And we can actually go at the problem and reverse the effect that is has on our ecosystem.
Janet George: For the first time, it’s a very doable problem. It’s something that can be attacked. Today, we don’t even get housing for a few million dollars here in the Silicon Valley. But we are able to actually create entire distributed big data computing stack with very, very small footprint. And so that allows us to do a very large amount of analysis, given the right compute and memory and storage.
Janet George: What are my lessons learned? Working with this data, working at scale, doing AI on insect images and trying to understand building prediction models on sea level rise, what I’ve learned is around the data collection and processing, we actually have to be very careful about how we collect the data. This is very important because there’s three components to what they’re trying to do. The first component of what they’re trying to do is the data itself. The second component is the infrastructure. And then the third component is: How do we actually take the models that we’ve built, and then how do we start to predict and use the predictions to make actual decisions for our future? My first learning around this is the KISS principle, which is really Keep It Simple. Get away from extract, transform, load, which are the traditional methods of loading a data and extracting the data.
Janet George: When we do the traditional methods, we actually cause loss of vital data signal, so we lose data in the process of trying to extract and transform. The best advice here is to really store raw data, and do the transformations for that data dynamically as you’re using the data, or learning from the data. I like to keep the data free from entanglements. And by entanglements, I mean schema. I don’t want to enforce a schema on the data because then you will have to spend a lot of time undoing the schema. You want a loose coupling with the format transformations. If you have a tight coupling, you will be in the business of trying to format and reformat data at scale, which will consume all the time and energy required, rather than do the actual analysis.
Janet George: We want to build a near real time processing capability, so what we’ve learned is when we have sensor data, and we are observing the plant, we cannot train on just old plant data. We’ve got to train on new real time data because we can see the plant behavior change. There’s a lot of variability in the data during the day when the plant is exposed to certain climatic conditions, or the plant is exposed to certain insect populations, and the plant starts wilting. Or if the weed is taking over, you can see how slowly the plant composition is changing. And in order for us to manage and monitor and learn, also train our machine learning models near real time, we want to be able to observe and train almost consistently and constantly.
Janet George: We want to assess the signal strength of the data at the time we are ingesting the data, not after we ingest the data. We have spent a lot of traditional time on trying to get on top of the data quality. And we want to try giving up controlling the data. We want to just work with the data in its natural form, so we try to understand the data as it comes to us, and especially at scale, petabyte scale. We’re not going to be able to control and manage all of the data quality. We just have to make sure that we have enough signals in the data that we can do the predictions with a great deal of accuracy.
Janet George: And my third most important learning is that when we build our infrastructure, we want to make sure that it’s future proofed, so that we don’t have to continuously keep rebuilding and re-architecting our infrastructure, rather, we simply add to our infrastructure as the scale of data grows, and also modernizing our platform and our technologies so that we can be ready for the amount of data, so when we go from one petabyte of data to 30 petabytes of data, we simply add compute and storage notes. But we don’t re-architect our infrastructure. Rather, we spend time on understanding the actual effects of the data.
Gretchen DeKnikker: Great. That was amazing, Janet. Thank you so much. [crosstalk 00:19:30]
Janet George: Key takeaway slide, one key takeaway I think is around how all of us can help transform the impact of climate in our daily lives. We are irrevocably connected as humans and Earth, and we can do our share.
Gretchen DeKnikker: Awesome. Thank you so much, Janet. Actually, we have time for one quick question. I don’t know how quick this question is, but we’ll give it a shot. Can you elaborate on the dynamic schemas? And do you have any advice on how to manage them?
Janet George: Yes. There is actually, within the Hadoop Ecosystem Stack, there is Avro, and Avro is a dynamic format. You can use Avro and you can do schema on read or write, so you don’t have to enforce a schema. You can do the schema as you’re trying to analyze the data.
Gretchen DeKnikker: Amazing. Okay. Thank you so much for your time today, and this wonderful, timely topic.
Janet George: Thank you. And I appreciate the interest very much.
Gretchen DeKnikker: All right. Thanks, Janet. Bye bye.
Janet George: Bye.