Following up on his excellent talk on Pig vs. MapReduce, Donald Miner spoke to the NYC Data Science Meetup about using Hadoop for data science. If you can set aside the time to watch, it’s a terrific and detailed talk. However, if you’re pressed for time, you can use our time-stamped summary to skip to specific sections. (Quick note: The video ran out towards the end of the Q&A, but the audio is still perfect.)
Here’s the summary of Don’s talk, with the video, slides, and the full transcript below:
- Don’s background (0:10)
- Topic discussion (2:21)
- Intro to Hadoop, HDFS, MapReduce, and other technologies in Hadoop ecosystem (4:07)
- Pig (also see Don’s talk on Pig vs. MapReduce) (12:43)
- Mahout (14:57)
- 4 reasons to use Hadoop for data science (16:23)
- Example use cases (28:50)
- Data exploration (29:18)
- Filtering and sampling (31:18)
- Summarization (34:57)
- Evaluating data cleanliness (38:33)
- Classification (47:11)
- Data volumes in training (51:52)
- Natural language pre-processing (70:21)
- NLP Tools: Python, NLTK, and Pig; OpenNLP and MapReduce (70:58)
- TF-IDF approach to NLP (72:42)
- Recommender systems (75:26)
- Collaborative filtering (77:25)
- Graphs (81:26)
- Clustering (84:33)
- R and Hadoop (85:06)
- Q&A (86:30)
- SQL and key value stores vs. Hadoop for data science (86:48)
- Why not just use a graph database? (88:39)
- Delegating and working with a data science team (90:36)
Here’s the transcript:
(0:10) My talk today is called “Hadoop for Data Science,” and I’ll talk a little bit more about what that means in a second, but I’m going to introduce myself first. I work for this company called ClearEdge IT Solutions. We’re a small boutique Hadoop data science shop. We do a lot of work for the government. We also do work elsewhere, and we’ve been working with Hadoop for a long time, so we kind of get called into a lot of that stuff.
(00:33) I went to a school called UMBC in Maryland, and I got my PhD there in machine learning. When I graduated with my PhD, I spent several years of my life doing machine learning just because I liked it, but I couldn’t find a job doing it, really. So, I decided to get a “real job,” and they asked me on the first day, “Do you know how to do Hadoop?” and I said no, and the rest is all history, but the neat thing is that now it’s coming full circle where now I get to use both. I don’t know if that’s luck, but all I know is that I’m doing both of these things now.
(1:08) Me, personally; I’ve always lived in Maryland. I’m a Ravens fan, but my dad is a Giants fan, so I like the Giants, too. If you want to follow me on Twitter, @donaldpminer. I don’t tweet very much, but every now and then, I’ll try to say something deep. That’s my email address if you want to get in contact with me (email@example.com). I like using Stack Overflow a lot. I like to answer questions on the Hadoop and Pig tags. I wrote this book called MapReduce Design Patterns which has absolutely nothing to do with this talk, but I brought some books for free. I don’t know how we’re going to give them out, but I’ve got about twenty something.
(1:42) I wrote this book; this book is about problem solving in Hadoop, so it kind of assumes you have some remedial knowledge in Hadoop and that you want to solve some more complex problems.
(1:54) I kind of put some technologies that I like using down at the bottom. I’m a long user of Python. I like Hadoop a lot. I like Pig a lot. I’ll be talking a little bit about Pig, and I use this technology called Accumulo, which is a BigTable implementation a lot like HBase that the NSA put out. So, that’s kind of like what I use day to day, and I’ll talk about some of the other technologies that I use day-to-day as a data scientist.
(2:21) Let me talk a little bit about what I’ll talk about. This is a meta-slide. I’m not sure if this is cliché yet – is this cliché yet? The Step 1: Collect underpants, Step 2: ??? – that’s cliché now, right? I’m not sure if that’s transitioned yet. That’s what this talk is about. Not about collecting underpants, but more about – more than often, I’ll say you see Hadoop and data science go together for some reason, but nobody ever talks about what that thing in between is. It’s like ok, I’m trying to do data science, and I’m trying to do Hadoop, but what the hell does that actually mean? I guess what I’m trying to focus on in this talk is what those question marks are, not so much like what Hadoop is, necessarily – which, I’ll get a little bit more into what data science is, if that makes sense. I’m trying to talk about what that gap is.
(3:10) I’m going to start out with an Intro to Hadoop because this is a data science meetup, so I’m not assuming everyone knows Hadoop. I’m going to try and convince you why Hadoop is cool, and then I’m going to talk and go through some examples of data science stuff on Hadoop to motivate some stories.
(3:24) One thing I want to caveat this with, as well, is that this might be one of the hardest talks I’ve ever written, in terms of sitting at my desk, deleting slides, creating new slides and stuff. What I realized in writing this talk is this is a really, really complicated, intricate discussion. I’m going to caveat this with – I’m definitely kind of nervous about coming out and saying some of the things I’m doing, so I would definitely like to open up to discussion, make your own conclusions and things like that. This is kind of my snapshot of what I’ve seen. I’m not like the Word of God on Hadoop and data science or anything; just take from this what you will, I guess.
Intro to Hadoop, HDFS, MapReduce, and other technologies in Hadoop ecosystem
(4:07) To kind of get into the talk, let’s talk about Hadoop a little bit. So, very high level of Hadoop is just like distributed platforms, works on thousands of nodes, hundreds of nodes, tens of nodes, whatever. It’s got two major components. One is the data storage platform and also the analytics platform. It’s kind of two pieces. I guess you could use HDFS as a storage platform or MapReduce just as a computation platform, or you can swap out HDFS for like S3, things like that.
(4:34) Next big thing is that it’s open source. It was open sourced by Yahoo! – or, it was open source and then it originated from Yahoo!, and that’s really cool, too. Anybody could use it. Hadoop is unseating technology’s like Enterprise Technology’s Oracle from organizations, and that’s a pretty big thing for an open source technology. And then the next thing is that it runs on “commodity hardware.” Hadoop itself is software; there’s no hardware associated with it. The hardware you can get is kind of generic. I mean, I’ve built Hadoop clusters out of workstations before, and it’ll just work. It won’t work great, but it’ll work, so that’s kind of neat. I don’t really like the word “commodity hardware,” because you can spend like 20K a node on really nice Hadoop nodes, but it is what it is.
(5:23) The first piece I’ll talk about is HDFS very briefly. Pretty much, HDFS isn’t that fancy. All it does is stores files in folders; that’s it. You can read files, you can write files. You can’t modify files, but you can delete them. There’s some limitations on what you can do. It’s got really big block files – or, really big blocks. So, I might have a 100MB file that may be split into one or two blocks, meaning that chunk of data is stored together. The next big thing is that it’s got three replicas of each block, so effectively, you’re running at a 33% storage efficiency rate. A storage guy would be kind of like, “oh my God, that’s awful,” but it is what it is, and there’s a number of reasons for that. And two, blocks are scattered all over the place. This one’s an interesting point in that – I’m going to split my files up into blocks and scatter this relatively evenly across my cluster in a kind of random fashion. It’s not like I’m sharding and I know this block is directly here or not, and there’s a bunch of infrastructure that goes around supporting this. But this provides you a storage layer for the next thing, which is MapReduce.
(6:32) What MapReduce does is it analyzes raw data in HDFS, and it tries to do it where that data is. That was kind of the big point; it’s going to take this little chunk of Java code that you wrote, and it’s going to ship it over to all the different nodes, and it’s going to run over data that’s local there. And the idea there is that that gives you lots of good things, which we’ll be talking about in a second.
(6:51) The name MapReduce comes from the fact that jobs – so, you think of a job is a thing that you’re trying to do – is going to be split into Mappers and Reducers. Mappers and Reducers come from functional language type terminology. Map basically means Record in/Record Out. A record comes in, I do something, and it comes out. And then a reducer takes a list of objects of things and reduces it down into some individual number.
(7:24) If you think about what MapReduce might do, what Mappers do is they slurp the data out of HDFS in a relatively data-local way. And the kinds of things you might do here are you might do filtering, transformation, you might do parsing. In a data science context, this is where you’re doing a lot of cleansing or maybe some preliminary NLP or tokenizing. You may be doing things like filtering out records that are crap that you don’t want, all kinds of things. Then the Mapper, once it’s done doing all of this record-by-record stuff, it’s going to output a key value pair, and I’ll get to what that means in a second.
(8:01) But look at what the Mapper does; for each record, it does this – one record doesn’t care what the other record is doing, right? There’s a couple things. One record doesn’t care what the next record is doing; he doesn’t care how the next record is transformed; he doesn’t care how the record on the other node was transformed, and it doesn’t care about the order, either, of which things are transformed. That’s where the parallelism comes from. The fact that I don’t care about any of those things means that I can just run things wherever I want, whenever I want, and I don’t have to worry about synchronization and all that crap. I put up this key value pair, and what the Mapper’s framework is going to do is something that I like to call the Magic Step, which is the shuffled sort. What it does is it takes the keys and it groups your stuff by key.
(8:49) If the key is like a state in the United States, it’s going to group all of the values that you associated with that key together. Let’s say I wanted to group by state, and my values were people. In the Reducer, I would get – here is the state is New York, and here are all of the people that I have records for in New York. And then the Reducer – think about what the Reducer has now; the Reducer has this grouping, and in that grouping, you’re going to do things like aggregating, counting. You can do statistics, standard deviations, medians, all kinds of stuff on that little subset of data that you group by. And then the Reducer, once it’s done with that, is going to output to HDFS. Now, the real neat thing is just that very simple thing with Mappers and Reducers.
(9:31) What I like to think about this is is MapReduce is kind of like a game. If you can figure out how to play the game right, you can do a whole ton of things. I had a buddy that built video gambling things, and he told me – I don’t know if this is true or not – everything boils down to Bingo; it’s like Bingo Complete. For those of you who don’t understand, like NP-completeness and stuff. Basically, Bingo is okay to gamble on, so if you can make a game that’s effectively Bingo, then that game is legal. It’s almost like, here, take your problem and make it MapReduce, then it just takes care of the rest. That’s where this gets, but you can imagine that’s hard, so if I can’t fit my problem into Mappers and Reducers, and I try and do something else, then solving things in MapReduce is hard.
(10:24) Those are the two core components of Hadoop. There’s also this thing called the Hadoop Ecosystem, and for some reason, they’re all animals. Two major things to call out – one is Pig and one is Hive. Pig is a higher-level language; it’s called a data flow language where you basically describe data flow transformation. I’ll be talking about Pig in a second.
(10:47) Hive is a SQL-like language on top of Hadoop; so, if you have some sort of comma-separated data file, I can write this SQL language. If you’ve been writing Oracle SQL for 20 years, you’ll be disappointed in the amount of SQL you can write, but it gets the job done. Those need to exist because MapReduce, you’re typically writing in Java, and you’re writing these things in Java, and it’s very verbose, and it’s very manual. Pig and Hive will give you this higher data language, and you can get the job done faster, but the problem is that you have less control. That’s the trade-off you’re making.
(11:20) The next one is – there’s also been these data systems that are built on top of HDFS, like HBase and Accumulo. I’m not going to spend too much time talking about these, but these have been useful to me in a data science context. These are basically large-scale key value stores. They’re sorted maps, so if you know what a sorted map is, that’s great. But basically, you can fetch things by key very quickly, you can write things very quickly, so it solves a lot of the real-time problems that you may have with Hadoop, and these are good transactional stores and things like that.
(12:08) Then you’ve got this whole bunch of other just random crap that seems to be just tossed in that has to do with Hadoop, like Zookeeper is a thing where you can keep little tidbits of information and have some synchronization properties. It actually has nothing to do with Hadoop whatsoever but for some reason is considered part of the Hadoop ecosystem. Flume and Storm are these streaming data collection frameworks. Cassandra is another key value store but doesn’t necessarily use HBase, and then Avro is like a data format. So, as you can imagine, there’s all these projects that keep coming out, and they’re all part of this ecosystem. You have a lot of tools available to you.
Pig (also see Don’s talk on Pig vs. MapReduce)
(12:43) One that I want to dive into a little bit is Pig, and the reason that I want to do that is because I use Pig a lot for data science. Pig is really good for things like data exploration. It’s good for grouping things together. It’s also really good glue – so, if I’m trying to use an external library like OpenNLP or NLTK or Weka, or something like that, as long as that thing’s got an API, whether that’s Java or Python or something like that, I can use Pig to glue that into MapReduce, which is kind of neat. That’s one thing I use it a lot for. It’s a good base. I really like Pig for doing data science. What Pig does is it provides you some higher-level concepts. You have things like GROUP BY, and DISTINCT, and FOREACH, and FILTER.
(13:30) As you see, I’m not talking about Mappers and Reducers here. What Pig does is it looks at, say, FILTER and GROUP BY, and it builds this query plan of saying, how would I solve this problem with MapReduce? And that’s really neat – it saves a lot of time. If you were trying to learn Hadoop, I would almost say write Word Count, which is like the Hello World in Java, but then right away get into Pig and then maybe go back to Java MapReduce later. That’s usually what I suggest.
(13:56) Pig is cool, too, because you can write custom loaders and storage functions as well so you have a lot of flexibility in terms of what kind of data you work with and, like I said, I use it a lot. Here’s an example of a Pig script (see slide 8) that – so, it’s going to load this data in this text file that I have in HDFS. I’ve got the name, I’ve got the age, and I’ve got the state of that person living in it. I’m going to group by state, so I’m going to say collect all the guys by state, like I said – that was the key in my Mapper in my previous example, and then for each of the groupings, im going to go through and output the grouping, which was the state. So, maybe this will output Maryland or New York or something. It will count how many items were in that group, and then it will give me the average of the ages in that group. It will be something like – imagine if this was Meetup’s database. You could say, Meetup members in Maryland are of average age 27, while meanwhile in New York, they’re 29. I don’t know if that would be useful, but that’s the kind of thing that you can do in Pig easily.
(14:57) The next one I really want to point out is Mahout. Mahout is considered part of the Hadoop ecosystem, and Mahout is a machine learning library. Mahout is kind of de facto the machine learning library for Hadoop, but I really kind of like to toss it around. I think it’s a machine learning library that happens to use Hadoop sometimes. There’s a lot of things in Mahout that don’t use Hadoop at all and do everything locally. It’s kind of this weird hybrid between a Hadoop library and not, but if you’re looking for a machine learning library for Hadoop, you should probably look at Mahout first.
(15:33) The three major categories that it tries to handle is one is recommenders, which I’ve probably used the most. If you’re trying to just use a really basic recommendation system, you want to take your first stab at it, I would highly suggest Mahout. You could probably get going in a short period of time. I think most people that take recommendation kind of seriously, though, at a certain point start getting a little bit frustrated, and I think that’s true of most libraries in general. After you’ve solved the problem initially, you run into something that your framework doesn’t do and then you have to write everything from scratch anyways. That’s kind of the same situation here. It also handles clustering and classification as well. It’s got some classifiers and clustering, which I’ll be getting into a little bit more later.
4 reasons to use Hadoop for data science
(16:23) At this point now, I’m going to try and convince you of all these cool things about Hadoop and also to kind of weave in why it has anything to do with data science, and I’m going to go over four cool things.
(16:35) Cool thing number one is linear scalability. Linear scalability is this general rule where if I have a job that, let’s say, takes 10 minutes – if I double the number of computers, keep the same data size and the same job, that job should now take five minutes. If I double the amount of data, it should take twice as long to run. If I double the amount of data and double the amount of nodes, it should take the same time. This is this general rule of thumb; this is not actually the case, but for the most part, it’s actually pretty good. When you’re not scaling down, like at very high scales – if I’m talking about the difference between 500 nodes and 700 nodes, linear scalability does hold true, but when we’re talking like five versus two nodes, maybe you won’t see that, but at the higher scales, you’ll definitely see linear scalability.
(17:34) This is important for data science because you can solve problems with hardware. If I want my data exploration to be going faster, then I can just buy more computers. If I want to store more data, I just need to buy more computers. And the scale of it is really why we’re even talking about this at a data science meetup. Hadoop and MapReduce in general are not very good technologies for data scientists to use, in my opinion. I’d much prefer to be in this pristine environment that’s interactive, the tool has visualizations in it – Hadoop is almost more like a thing that you have to use at a certain point, and actually this is true outside of data science, as well. If you don’t have to use Hadoop, there’s no reason why. MapReduce isn’t a feature; MapReduce is like a constraint. Everything Hadoop is constraining. The constraints I have on HDFS, the constraints I have on MapReduce, the constraints I have in Pig, these aren’t making your lives easier; it’s making you’re life harder. The trade-off you’re making is, okay, I’m going to make my life a little bit harder by using this system, but it’s going to take care of a bunch of stuff for me. I’m really trying to drive this home. These four cool things I’m going over are reasons why it’s good and reasons why you should use it. If you don’t have to use it, don’t.
(18:48) The next cool thing is schema on read. This is probably my favorite one. I’ve got these two pictures here to contrast (see slide 11). Imagine US Congress versus a vigilante cowboy. That’s like the difference between before and after Hadoop. So, before when I was working on data projects, we would have these long meetings with lots of people where we would decide what the schema was, and we would decide what the ETL process was, and then we would go off and do them, and then we’d be like, oh that didn’t work, we have to have more meetings. You had to do a lot of planning before; if you made a mistake on the schema, that could be catastrophic for the success of the project in the future. You had to do a lot of thinking ahead of time.
(19:32) With Hadoop, it’s kind of neat because you just kind of load the data first and then it’ll ask questions later, and this is really cool for a number of reasons. First, I can keep the original data around and not really care. Before, as we were sitting in meetings, we were dropping data on the floor. If I couldn’t fit it into my SQL database, that record wasn’t going anywhere; we were dropping it. The neat thing is we can start the data flow and start setting that data up and keep that original data around, and there’s a number of good reasons. First of all, obviously, I don’t have to do any transformation processes to get that data in. The other good one is that ETL processes are software; they can be buggy. If you’re dealing with original data, you’re safe. You don’t have to go back and fix the bugs of ETL processes.
(20:33) The other cool thing is you can have multiple views of the same data. I have the original data, and then what Schema on Read means is I’m applying the structure, or the schema, or the structure of the understanding of the data at runtime. What that means is I can have different interpretations of that structure every time I run a job. So, a good example, let’s say I have some analytics against books. I’ve got lots of books in my corpus, and I’m reading books. I might want to structure my data in different ways depending on what work I’m doing. Maybe a “record” for me is the entire book, maybe a record is a chapter, maybe a record is a paragraph, maybe it’s a sentence, maybe it’s a word, maybe it’s a character. Depending on what I’m trying to do, the structure of that can be different. That’s kind of neat, because if I wanted to structure the data in different ways, I may have to store the data in different ways, and that sucks. I’m not going to do that. The other cool thing is I can work with unstructured data, which I’ll be talking about a little bit more sooner. Unstructured data makes it even nastier of a problem, a nastier ETL problem. That means I can just – I can dump it in, and I can call my data scientists up and say hey, the data is there; go do something with it. Like I said, the biggest point to take home is store first, figure out what to do with it later. By the time you figure out what you’re going to be doing with your data, you’re already going to have like a month of data sitting there, fantastic.
(21:54) The next really cool thing is transparent parallelism. What I mean by this is is you don’t have to worry about any of the things that you would typically have to worry about when you’re dealing with parallelism, or parallel systems. Things like, do I have to worry about how to scale the system, do I need to worry about network programming, do I have to open sockets and things and send packets, do I have to deal with synchronizing and locking, inter-process communication, data storage, fault tolerance, threading, RPC, message passing, distributed stuff. These are things I don’t have to care about anymore. I don’t have to be a master of distributed systems to do this stuff. So, the trade-off, which I kind of mentioned earlier, is all I’ve got to do – I like to visualize this as I’ve got this big, messy solution, and it’s your job to put that into this little box that’s about this big, and that’s the MapReduce box. If you can execute it in MapReduce, then you don’t have to worry about all these things. But, if you try to do something outside of that box – I’m trying to do something that’s not necessarily MapReduce, then you start to have to worry about some of these things.
(22:56) One story that I like to tell about this is that I had a problem – so I was trying to solve a streaming problem before I had things like Flume and Storm. Streaming is I’ve got little records trickling in, and I want to process those records as they’re coming in. So us, in our infinite wisdom, we had this little Perl script that would run that would process this data and then do something to it. And we wanted to parallelize it, right? So we would just SSH into a random node and then execute that script. What happened was is we got a locking problem, and that script just stalled and started spinning in circles, and then it kept spawning and spawning and spawning, and the entire thing was running a bunch of Perl, and the cluster was just dead. So, the issue that we ran into was that we went outside of this MapReduce box, and we ignored the fact that we had to deal with locking and synchronization and inter-process communication. If you keep inside of the box, you can pretty much say that Hadoop will take care of it. That’s really the hard part of doing data science with Hadoop is, I’ve got to figure out how do I solve this problem here and there’s a lot of ways that you’d like to solve problems today outside of Hadoop that aren’t going to fit, honestly. I was just talking to a guy before the talk started about doing some Matrix stuff and putting that in there, and it’s not going to happen. Well, hopefully it’ll happen – sorry!
(24:25) Cool thing number four is unstructured data. Unstructured data is typically associated with Hadoop, for some reason. The cool thing is I can store the data in HDFS, and I don’t have to really care – like I said, it’s just a file, right? It could be anything; it could be a compressed file; it could be an encrypted file; it could be a text file, CSV file, image, whatever. Then all I have to do is write some Java code to read that file. The cool thing is – some examples of unstructured data might be media text, free forms like medical forms, maybe. Log data is one in particular that may have structure but different lines may have different meaning based on context. My favorite form of unstructured data is a bunch of structured data that’s not the same structured data. So, somebody may have given you like – here’s like all of our data, and there’s 90 different formats of structured data; that would be something I call unstructured. Unfortunately, the higher-level languages like SQL and Pig or Hive, let’s say, require some sort of structure to that. Every time you bring up your job and Schema on Read happens, and then you need to say ok, well this is what this data means. But one thing to realize is MapReduce in the Java MapReduce sense is just Java. You’re writing Java Method, so you’re using the real JVM. Basically, anything Java can do, you can do in MapReduce. It might not necessarily be a good idea, but you could do it. If I want to read an image file or an audio file or something like that, if I want to do OCR — if I can do it in Java, I can at least try to do it in MapReduce. Is anybody not convinced that Hadoop is cool? Alright. Get out. I’m just kidding, you can stay.
Example use cases
(28:50) The rest of the talk – I don’t actually want to talk about these four things (see slide 14; data exploration, classification, NLP, recommender systems), so I decided to just talk about some real things and hopefully I’ll make some points along the way. That’s how the rest of this talk is going to go, and I’m going to talk about data exploration; I’m going to talk about classification, and I’m going to talk about Natural Language Processing and recommender systems, and then I’ll briefly talk about some other random crap.
(29:18) The first thing is data exploration. I think Hadoop is actually really, really good at data exploration. If I have a bunch of data that I don’t understand and I want to figure it out, Hadoop is really good. A couple reasons. One is, obviously, I can store the data like I just said. I don’t need to understand the data first; that’s the whole point of exploration is I don’t understand the data. And if I need to understand the data in order to get the data in, that’s bad. Hadoop is good because I don’t have to worry about that. The other thing, too, is it’s obviously fast, from an “I want to deal with lots of data” perspective. So, sometimes large data sets, exploring large data sets, that’s good. When I go into a consulting engagement – for example, I’m about to do some data science, I wanted to list out some of the things I do, like maybe the first week. Just as a general rule of thumb, if I’m on a four-week engagement, let’s say, I want to spend about 50% of my time doing exploration. In my experience, exploration is really, really important, and documenting it is really important. I think the bad, unfortunate thing is that most of my customers don’t agree with me, right? So I get hired to solve, build some magical machine learning classifier thing to solve all their problems and what they don’t realize is that really I should do some data exploration first and then tell them what value they could be getting. There’s this kind of step that you need to get to first to get value. I think data exploration is really important. Usually what happens is I’ll say yeah, ok, I’ll build whatever classifier to solve your problem; we do data exploration, and I show them the raw, hard facts of what’s in their data, and they’re like, ok for the next two weeks we can do what you said. Right, that’s usually what happens. The kinds of things that I like to do that I’ll talk a little bit more – one is filtering, the other is sampling, summarization, and evaluating cleanliness.
Filtering and sampling
(31:18) Filtering – I like to use this analogy – is a lot like a microscope. Filtering is, I want to look at the record in its true form, but I only want to see some of them; I don’t want to look at all of them. I’m not going to look through 10 billion records, I just want to look at 100 of them. In MapReduce, it’s pretty easy. In the Mapper, I just say: do I want to keep this record or not, and if I don’t, then I throw it away; and if I do, I say yes, then I’ll input it, and then I just look at the output. Some examples might be I only want to look at data about New York – that may even be too broad, right? Maybe I only want to look at data in New York about people in their 30’s that are married. Another thing is, at this point, when you’re filtering, you may start picking up on gibberish in your data, things that are dirty, things that shouldn’t be there, things that are useless. Twitter seems to be a pretty popular data set to work with, but I thought – for people that I follow on twitter versus the mass population of tweets, you would be surprised how much just crap there is. Literally zero information in a tweet. Like, you have to try hard to write 140 characters and not give any information. There’s a lot of just sitting there and picking up on things as you start filtering, and you start writing down these rules of things I’m going to remove later because it’s crap. And then also, too, by time – you may slice by time, for example.
(32:59) Filtering is really cool because you can take all these microscopic looks on little pieces of data, and what I’m kind of hunting for when I’m doing filtering with Hadoop is I’m kind of looking for something that I may build a classifier for or something. I would say that it’s very rare that I build a classifier on all the data. What I’ll do is I’ll take a subset of the data and solve a specific problem.
(33:25) The next one is sampling. Sampling is pretty much filtering, but I couldn’t fit it all on one slide. Hadoop isn’t really that good at interactive analysis. There have been a ton of projects that have sprung up around doing better interactive analysis, but for the most part, we’re not talking about putting this thing into Excel and then doing like pivot tables and stuff, or maybe even putting it into a SQL database and running SQL queries against it. We’re not there.
(33:45) Some kind of different types of sampling that I like to do – kind of a random sample. Like, I’ll just pull out something like a random thousand records, or a random 10% of the records. Maybe I’ll do the sub-graph extraction. Say I was doing some sort of social media analysis. I’ll pick a random 1% of the individuals, and then I’ll go pick a random 1% of their friends, and then 1% of their friends, things like that, so I’d at least build complete social networks, but it’s a subset of the social network. That’s another mode of sampling, and I could also do filters. Sampling is good because you can get a feel about what approaches might work or might not work, what kind of data is in there. Because if you just start and arbitrarily take the first 100 records, what I find with data is that, just the way data is written, you’ll find a lot of similar records put together, and then you don’t get really a good view of what’s in your data in its entirety. Pig has a cool keyword called sample that is pretty useful where you can trim down the number of records.
(34:57) The next one is summarization. Summarization is kind of like the opposite of filtering. Summarization is this bird’s-eye view of your data. Filtering is like I’m taking a real nit-picky look at my data; summarization is like I’m looking at a big – my data in a whole. MapReduce is built for summarization; that’s exactly what it does. Reducers summarize data, mappers get the data, and then summarization is really easy. A lot of the things that I like to do is I like to – for different columns or different pieces of the data that I’m looking for, I like to count how many records are there. Counting the number of records that are there is pretty interesting, a lot of the time. Get things like the standard deviation, get the average, so I can get a feel of what my data is. If I’m trying to be really advanced, maybe I’ll see if something fits some sort of curve. Getting the min/max of records is pretty cool, counting nulls in columns. Pretty much every time I go work on some system that uses some database, old-tech database technologies, probably mre than half the columns are just always null, for some reason. I have no idea why. And then one of my favorites is grabbing the top 10 lists. I freaking love top 10 lists. Seriously, build a top 10 list on everything you can find. I’ve gotten so much value for my customers on top 10 lists. Like, probably more than all the snazzy Machine Learning stuff I do, top 10 lists are probably more valuable. I should just make this talk about top 10 lists.
(36:32) One example, I was working for a telecommunications company, and we were looking at their call data records, and they wanted to reduce roaming. They knew they were losing a bunch of money on roaming, and basically they wanted to solve it with, hey, if I place the cell tower here, then I would lose less money on roaming. They basically wanted to do automated cost/benefit analysis of placing cell towers. And I grabbed the top 10 list of the top 10 cell phone numbers that placed calls. There was one guy that was like eight orders of magnitude higher than everybody else. It was a ridiculous number of calls in a day. Moreso, it was like the guy was placing something like 30 or 40 calls a minute, which is impossible for the most part. So, they run off and go take a look, and what they found was it was – I forget what system it was, but in cars, you have something like OnStar or something like that that everybody has a different one – supposedly if you weren’t subscribed, it would use this default number instead of using a number specifically for your car or something like that. Basically, this phone number was placing all these roaming calls throughout the day that were like half a second. And I don’t know, but I would imagine if they removed that, then they could save some money.
(38:07) Some other top 10 list things I’ve done, things with cybersecurity type use cases. I’ve seen number of packets in and out, like who is the most chatty person on the network. That’s always interesting. That’s probably been useful to the people every time I’ve done that with cybersecurity use case, the top 10 list of packets. There’s usually nine of them, they’re like, yeah, I understand that one, but that guy, I’m not sure why that guy is there.
Evaluating data cleanliness
(38:33) The next one is evaluating cleanliness. You never know how clean your data is going to be. You never know what problems you’re going to have, and one thing I like to think is I’ve never been burned twice by the same data integrity issue. I have this list in my head because you have a very traumatic experience when you run into one of these things. You’re never going to forget it. Even if I know I’ll probably never see it again, I still take a look for it. One is looking for 1970 in my data set. I got burned by that really bad one time. Basically, for whatever reason, instead of throwing “null,” they just returned UNIX timestamp zero, which turns out to be whatever that date is in 1970 at the beginning of the world, effectively. And that skewed off all my counts and everything, of course, so I always check for 1970s.
(39:41) Things that I check for: things that shouldn’t be null, like dates shouldn’t be null and things like that, user IDs. There’s things that shouldn’t be null that are usually null for some reason. I actually spend a lot of time on finding out how duplicated the data is. This is really important because if you train a model or something on that, duplicates sometimes screw up the weights of certain things. So, getting rid of duplicates is pretty important. A couple ways I do this is you can do “distinct” in Pig, you can do specialized Droop By in MapReduce. I spent a lot of time looking at duplicates, and the metric that I’m looking for is after I de-dupe the data is how many records I had dropped, and I kind of get a percentage of how many records of the data set are duplicated. And once I find out if it’s bad, I look at why the data is duplicated, because you’d be surprised. Maybe it’s just the all-null line is the duplicate, in which case that’s not really too interesting.
(40:38) One time in a cybersecurity use case I found duplicates, and I counted the times that things were duplicated – this is another thing I do with looking at duplicates is – okay, records are duplicated and I removed 30% of them, but of those, which one are 2X duplicates, 4X duplicates, 5X duplicates, if that makes sense. As in, this record I saw three times, this record I saw four times, and I count that. In this one case, I found that there were a lot of 2X duplicates, which was not surprising. There were a decent amount of three, there were no four, no fives, but there were a ton of 6X duplicates. Now, that’s weird. Why the hell would there be no 4X duplicates and 5X duplicates? Well, it turns out – I don’t completely understand networking entirely, but what they told me was that – I pulled out the 6X duplicates, and it was a pretty small portion of the data set, but what they found was that there were some bug in the routing rules that they had for this one office where it was doing this round robin over all the top – it was going around all the top routers before it was finally getting back to where it needed to go. So, they found this major routing problem for this one office, and hopefully those people in that office now have faster Internet.
(41:50) Other kinds of things I look for: things that should be normalized, like am I dealing with a string that needs to be uppercase or lowercase. If you’re dealing with things like people’s names, that could be a real pain in the ass if it’s freeform, so you’re dealing with last name comma first name, are things capitalized. You start getting people’s names with like eñes and stuff. How do you deal with that? There’s all kinds of crazy things that you can look at that should be normalized, and that’s another thing I look at. And sometimes you can look at things and they don’t need to be normalized.
(42:24) (See slide 19) This is one I ran into recently. There was a unique identifier that was a six-digit code, and for some reason, in some of the columns there were spaces here. So, when you do a group-by and I’m trying to group things by things, the spaces actually make it look like a different key to the computer. So, I had to go strip out these spaces. I noticed it because my recommendation system was making me a recommendation for “abc” and “abc ,” which like, what the hell is that. And then I fixed it and my recommendation system was much better. Evaluating cleanliness is a big part of exploration because you don’t know how clean your data is going to be until you get there. Like I said, I do data science services. So, data exploration – I kind of which I could get my client to pay me in two increments, like pay me at the end of data exploration and then you can decide to pay me more or not, because me, as a consultant, I have to guess going in what I’m going to find. Let’s say they wanted me to write a classifier on predicting how many customers they’re going to lose next month, and they give me a bunch of data that’s corrupt. What am I supposed to do in that situation? This is a nightmare that I live every day.
(43:42) What’s the point of all this stuff? What does this have to do with Hadoop and data science? The point here is that Hadoop is really good at this. Actually, this is one of the cases where you should maybe use Hadoop when you don’t really have to, because if I was doing all that exploration stuff that I was talking to you about, and I was doing it in R, let’s say on my local machine or my laptop, that may take time, more time than I want to take. Instead, if I have this 10 node Hadoop cluster, 20 node Hadoop cluster, my queries are getting back in one, two, five minutes or something like that.
(44:23) The next thing is, you probably have a lot of data, and a lot of it is garbage. What I mean by that is you don’t really need a lot of the data, but you have a lot of it. It’s almost this kind of problem where you don’t know what data is valuable until you explore it, so you have to collect all of it first, and if you don’t have all of it first, then you don’t really know what it is yet; Hadoop is good for that. My next point is – really, I hope I’ve drilled this home enough; really take the time to do this, because it’s going to make your life a lot easier. When I talk about this pristine data science environment where I’ve got all these – I imagine myself in this futuristic movie where all the visualizations are coming at me and I’m typing stuff and things are returning – I press enter and shit returns back really quick. Like, I can’t wait until they have a data scientist in a movie and he runs a query and it returns back the answer, when in reality in the movie, you’re sitting there waiting like, “alright, you want to go get lunch? I have to wait for my MapReduce to finish.” So, I guess what my point is is if you spend a lot of time the data cleansing and the data exploration stuff – you can set that environment up. You can build these smaller environments that are a little bit faster. You can set it up so that it’s fast. You can set it up in HBase so that it’s fast and things like that.
(45:50) And then the next thing, obviously this is true with Hadoop or without, it’s hard to tell what method you should use until you explore your data, so it’s really important.
(47:11) Getting into the first thing that’s kind of like data science – so, classification. A little bit of background on classification in case you don’t know what classification is. Basically, what I’m going to do is I’m going to have a feature vector that explains to me about what something is, and then I’m going to have some sort of label that tells me what I’ve categorized and said what that thing is. One example that I’ll never forget – this was one of the first things that I did in machine learning class several years ago – basically, I had these features about the day, and I wanted to learn whether or not I was going to play tennis. Example here is the feature vector could be: it’s sunny, it’s Saturday, it’s summer outside, and so I’m going to play tennis. Meanwhile, the other one; it’s rainy, it’s Wednesday, and it’s winter, so I’m not going to play tennis. And the way I train this is I’m going to have all these examples of days that I’ve played tennis and days that I did not play tennis. I’m going to train this classifier so that when I plug in a new feature vector for a day that I don’t know yet, it’s going to tell me whether or not I’m going to play tennis that day. Obviously, you can do all kinds of crazy stuff with this.
(48:17) Here’s a bunch of caveats. Most classification algorithms are not easily parallelizable, or somebody hasn’t put forth the research effort to try to figure out how to do it. So, the number of classification algorithms you can run at scale is actually pretty limited, and they’re the ones that aren’t really that great. The next one is you need a good training set, and this is another luxury that is not always the case. Not only do you need label data, but for the most part, you need good label data, too. Good, complete label data. This is something that I’ve only had happen in a few of my engagements. In some cases, you have the luxury of doing this and everything is dandy. Classification in parallel is kind of really hard, and ill talk a little bit about why maybe it’s not so much of a problem. When I look at the overall classification workflow, when I’m working at a site and I’m going to build a classifier, I do some data exploration first, like I’d mentioned in the past, and I’ll try lots of different methods, and I think this is kind of true with most Machine Learning problems that I tackle, whether they’re in Hadoop or not; I’m going to try three or four methods. Maybe I’ll have an intuition about what will work and what won’t, but for the most part, I’m trying lots of different things. One thing that I run into a lot with clients is they have a lot of requirements about what these black boxes are returning. Like, one common ask is I want a confidence score; I want it to return to me and say, “I’m fairly certain that this is true.” Things like logistical regression will give you a score, for example. You can even extract scores from things like Naïve Bayes and things. If I’m talking about a support vector machine – maybe that’s a little bit harder… decision tree! Maybe I could extract some sort of score. They’ll impose requirements like that, and that kind of narrows down the scope of things I can try. Other things are like, maybe they want to understand why it made that prediction, so neural nets are out. Actually, decision trees are really good for that because then you can show – everybody understands a decision tree. Then you deal with a random forest and you’re like, well, all these thousand decision trees decide if this is the case, that doesn’t really work.
(50:43) Once I try a bunch of things, then I sit there and refine the promising ones. In this kind of larger stage, I’m context switching in and out of Hadoop. I may be doing a lot of stuff just in Weka on just a sample, maybe. There’s lots of different ways of doing things like this. The next one is, in general, how you train a model; I’ve got some data, and then I’m going to do feature extractions, so I’m going to pull out those feature vectors like “sunny,” and “winter,” right? Feature extraction, for example, isn’t necessarily one-to-one, so it’s not like I’m pulling data – feature vectors, there’s a lot going on with feature vectors. Data might be a bunch of records about people, and they’re not like – I can’t look up a person and have all the data about that person; I need to collect all that data about all the people and then build the classifier. The feature vector, that’s an involved process. It’s not just like copying data.
Data volumes in training
(51:52) I’m going to kind of go in to talk about – something I want to point out is data volumes in training, when I’m training stuff. As a general rule of thumb, there are very few MapReduce jobs that I’ve ever seen that create more data than was put into them. Data is going into a MapReduce job, usually the amount of data coming out is less, ignoring compression going in, compression going out. So, there’s no tricks like that. For the most part, you’re taking unstructured data and you’re imposing structure to it, it’s getting smaller, and only extracting pieces of data that I want. I’m filtering out records. It’s kind of hard to make data. I mean, there are certain cases in joins and cross-products where that happens, but for the most part, after I run a MapReduce job, the data is getting smaller. Let’s kind of step through a process of how I would train a model.
(52:47) When I’m first there, I have a ton of data, and the first thing I need to do is feature extraction. So, I’m going to do feature extraction, and what feature extraction is going to do is I’m going to take the raw data, and I’m going to group by the entities. Let’s keep going with this person example. I’m going to group by the person’s identifier, maybe it’s the person’s name, and I have all the records of each individual person together. So, let’s say that I’m a healthcare company, I’m going to have a bunch of medical records, but the medical records are stored in the system in the order of which they arrived, not by person. I need to get all those medical records together by person. And then once I have all those things together, then I can start building a feature vector so I can do things like, what’s this person’s resting heart rate, what is this person’s cholesterol level, has this person ever had an MRI, has this person ever broken a leg, etc, etc, etc. These are kind of numbers. I can also have aggregates; how many medications is this person taking, which I can’t just find from one record. That’s from the collection of records. So, I’m building this feature vector, and the big thing that I want to point out here is – this is a ton of data; this is an arbitrary amount of data; that’s why we’re using Hadoop. Here (pointing to slide 24), this is only as big as the feature vectors I have, and that really depends on the use case.
(54:03) Here are some examples. Let’s say I’m running a cybersecurity use case internally at some large corporation. They only have, let’s say, 9,000 unique IP addresses within the company, okay? That’s kind of reasonable maybe if they have 1,000, 1,500 people in that company. So, I’m talking about having 9,000 records, 9,000 feature vectors. I don’t need Hadoop for 9,000 records. The next one up; let’s say I’ve got medical records. I’ve got 10TB of medical records. Medical records are kind of big, maybe some of them are images; there’s all kinds of things that can bloat medical records. I’m going to distill it down to 50M patients. 50M records is kind of like – aw man, that’s going to take a while, but I’ll still do it. So, this is kind of like, I could go both ways, but if I had 5M patients only, then it wouldn’t be that big of a deal. Alright, the next one. I’ve got 10TB of documents that got distilled into, let’s say – I don’t know how many documents it was, but it ended up becoming 5TB. So I did something like normalization, tokenization, maybe this process is building word bags, which is a common way of building a feature vector for a document. Word bags aren’t going to be that much smaller than the actual text. In which case, then, you know, I’m screwed; I can’t.
(55:32) And then, once I train the model, the model itself is actually really tiny. The model is some sort of rule. Maybe it’s a parametric model, I don’t know; it’s pretty small, so I can do whatever I want with that thing. And then the next thing is, okay, I’ve got the model, now I’m going to go run it over all my data. That is another big data problem. What I’m trying to point out here is there’s things that should be done in Hadoop and things that should not be. If I’m dealing with lots of data volume here, use Hadoop. If you’re not using big data volume here, then don’t. So I guess my point is that a lot of the engagements that I’ve done in the past, I’ve used Hadoop for feature extraction, and then I pull that piece of data down, run some classifiers in R, get my model, push that back up – the model I built – and then use Hadoop to run that model against all of my data. I would say this is more common than using Hadoop here (points to narrow end of cone on slide 26, in my experience.
(62:17) Here are some hurdles from the problem before. One hurdle that I run into a lot is, where do I run that non-Hadoop code? I’ve mentioned that I’m doing feature extraction, pulling that thing down, and then I run R over it or whatever, and then I send that back up. Where is that little non-Hadoop code piece running? Is it on one single node? If it’s on one single node, then we have to worry about node failures, right? Now we’re going outside of the box and we have to start worrying about problems, and I’ve seen that solved in a number of ways. I’ve seen that solved like, you spin up a MapReduce job that only has one Mapper and then in that one Mapper, you run the task. I’ve seen things like having some sort of – if you’re in a virtual infrastructure, like you have an internal vSphere or you’re using Amazon or something, but you can spin up a new node to run that task. I’ve seen that be a solution to that problem. But that’s something, operationally, that you need to think about because it’s outside the scope of Hadoop.
(63:12) The next thing one is, I’ve built this entire pipeline, and I’ve done two things. One is, I as a data scientist am sitting there and I’m building some sort of report, or I’m answering a question for just one time. So, somebody’s trying to do a data pool and they need to know one answer, and then they never need to know that answer again. In that case, I don’t need to worry about operations or anything like that. I just do it, I get the answer, I put it in Excel and I send it off. One of the problems is, say I’m trying to get that answer everyday, so I’m running those classifiers and stuff over the data every day, where do I host out that result? Hadoop is just a data processing platform; it doesn’t necessarily host things out. So, a couple of things that I’ve seen done here is you can use HBase, for example, to store the data. That has a little bit more interaction to it. If the amount of results you’re storing out are relatively small, then you can deal with putting it in a Postgres database like Redis or something, but that’s something you’d have to think about. What I’m saying is, that’s outside the scope of Hadoop; that’s something you have to take care of yourself.
(64:14) Ok, so I trained a model, and I want to run that model over streaming data, meaning okay, I’ve got the model, I want to be making predictions based on the data coming in. How do I operationalize that? One of the problems is, that process I showed you earlier for feature extraction where I’m extracting features from people and collecting all this data about people, and then I’m grouping them together – that’s hard. Am I running the MapReduce job over all the data every time a new record comes in? Probably not. So, building, updating the feature vectors is, again, something you may have to use HBase for, or something along those lines. It’s something you have to figure out. If your model just deals with individual records, then it’s not so much of a big deal. You can also do things like bashing, so maybe your customer would be satisfied with getting the answer every hour over an hour’s-old data. That’s usually the easiest way to solve this problem is okay, every hour, I’m going to run this MapReduce job that creates all of the model predictions and all the new records.
(65:22) The next one is automating the performance measurement. If you can just deploy a model – so this happened to me in a retail place one time – you can deploy a model, and let’s say detecting anomalies in data, and then let’s say for some reason the data changes. So, in retail, what had happened was Black Friday happened. So, my model that I had trained, it trained again, but it was completely bogus because the performance was really low. There’s lots of ways to measure performance of your things and some sort of percentage in terms of accuracy. You kind of want to bake that into your operational dataflow of your model training because every now and then you’ll get a day where you re-train the model, and the accuracy sucks, for one reason or another. You just want to be prepared for when that happens, and that’s all. Some days models just aren’t going to work, or sometimes you need to re-evaluate, okay, the model’s been sucking recently; we need to improve the performance.
(66:25) A kind of other miscellaneous point, one thing I really like doing in Hadoop sometimes is – okay, I might be building one model, but in certain situations, I’ve built a model for every user, let’s say. So, you as an individual user have a model about what you do. In some of the cybersecurity use cases, I’ve trained a model about what’s anomalous for you as an IP address. Basically every day, I plug in your actions into your own model and see if you fit within your model or not. But in order to do that, I need to train something like 10,000 or 5,000 or whatever different classifiers, and that is a parallelized problem; that’s easy to be parallelized. So that’s a kind of neat thing that you can use Hadoop for, which is neat.
(67:15) What’s the point of this whole tangent I went on? I’ve drilled this in; not all stages of this building pipeline need to be done in Hadoop, and if you really need to, Mahout has a couple of classifiers that – like Naïve Bayes is one of them that I’ve used in the past. So, Mahout has a few that you can take a look at. And then two, there’s been some academically documented parallelized machine learning algorithms that you can take a look at as well.
Natural language pre-processing
(70:21) The next one is natural language processing. Natural language processing comes up a lot because you’ve got text data a lot, for some reason. Text data seems to be pretty common these days. I do a lot of stemming, lexical analysis, parsing, tokenization, normalization, removing stop words, spell check, things like that. These are not things that I want to re-implement – I don’t want to re-implement porter stemming every single time, okay? So what I’m going to use is I’m going to use NLTK or – I’m going to give two examples, one with NLTK and one with OpenNLP.
NLP Tools: Python, NLTK, and Pig; OpenNLP and MapReduce
(70:58) NLTK is this Python library that a lot of people like. I like it. I like Python, and so I started using NLTK. I like NLTK a lot. Like I mentioned earlier, Pig is factored on top, and Pig allows you to stream data through arbitrary processes. So, I can write a little program that does NLTK stuff, pass that data in, do the things like porter stemming and stuff, output that data, and then it goes right back into my MapReduce pipeline. Another option is to use UDFs, user-defined functions, but one of the downsides of that is it uses Jython, and Jython doesn’t always work with the Python packages as well as you would like; but, it’s a little bit faster than this, I guess. And the Mortar guys have done something cool with this, too.
(71:48) So I guess what the point here is, you’re going to use Pig to move your data around. You’re using it to group things and do that, but you’re going to use Python and NLTK to really do this — plug in to do the text-processing pieces of it. And the way you would do that in Python is using the STREAM keyword here, so I’m going to stream my data through this Python script.
(72:11) The next one is OpenNLP and MapReduce. OpenNLP is an Apache project that has Java API so you can basically write this MapReduce job. I’ve never done this, but I have a friend that uses OpenNLP, and he’s used this before. He plugs in OpenNLP — like I said earlier, it’s just Java, so if you have a library that works in Java, you can plug it in there. So, if you were writing Java MapReduce, you can just plug in OpenNLP and usually do things in the Mapper of processing data.
TF-IDF approach to NLP
(72:42) One of my favorite kind of approaches to do, just because it parallelizes well and the results are usually interesting, is TF-IDF. It stands for Term Frequency, Inverse Document Frequency. The TF is how common is the word in the document, and then IDF is how common is the word everywhere else, but the inverse of that. What that’s going to tell you is I’m only going to care about records that are — or, words that are particularly unique to this. The example I have down here is “The quick brown fox jumps over the lazy dog.” If you imagine — if you install this string in a document compared to, let’s say, the rest of the English language, the words that should stand out are probably “fox,” and “dog.” Maybe fox would stand out the most; I would imagine dog is a little bit more common than fox. “The” is basically going to be completely worthless every time. “Jumps,” “lazy,” “quick,” “brown,” those guys are going to be like kind of middle. So, this is really neat because it will allow you to pull out what I think the topics of this thing are. If you’re doing news articles, people’s names will come up at the top of the list, things like that. And I like it because it’s parallelizable. So I kind of use this actually in the data exploration step a lot of the time. I’ve actually used TF-IDF for non-text. The idea is pretty straightforward; I’ve got commonality in one subset and then commonality in this larger subset.
(74:10) Something I’ve done a lot with text as well is OCR, or speech-to-text. One thing to take a look here — we’ve done OCR over, let’s say, medical records. One of the issues here is that I’ve got lots of medical records. Hadoop is going to help you parallelize it so that your OCR is faster on one document, and what it will make faster is doing OCR on lots of documents. It’s a pretty important distinction. It’s not like I’m going to say, OCRing this one document only takes a minute; I’m not going to throw it on Hadoop and it’s going to take 10 seconds. What’s going to happen is I’m going to be able to do 10 documents at once in one minute. And same with speech-to-text or any sort of media processing thing.
(74:54) So what’s the whole point of this one? I didn’t want to talk about NLP too much. What I wanted to say was, hey, there are these libraries out there like OpenNLP and NLTK that you should use and you should figure out how to integrate them into your MapReduce job. So, whether it’s through Pig or through MapReduce, you have options, and you don’t want to write your own porter stemmer.
(75:26) I’ve been having a lot of fun with recommendation systems lately, and there’s a couple of reasons why. First of all, recommendation systems like a lot of data. Recommendation system says, I declare a preference for something — maybe it’s like Netflix and I like some movies, I don’t like some movies, and I’m going to come down and say, you should watch this movie or you shouldn’t. It’s kind of neat; once you start working with recommendation systems, you can start to really see which websites suck at it and which ones are good at it.
(76:02) Recommendation systems like a lot of data. They like a lot of preferences, and they like a lot of users, because people are pretty unique, so you need lots of people. And then also, too, I don’t always have all my preferences out there. The whole point of you recommending a movie to me is you know that I’ll like it; if I knew that I’d like it, I’d have watched it. So, that’s kind of why you need a lot of data. And then the other problem, too, is you want to make a lot of recommendations. I actually worked with a company that pre-built all of your recommendations for the day overnight. And then if you asked for that recommendation, it would have it ready for you, and actually a lot of websites do that. That’s kind of interesting because maybe you don’t — only a quarter of people will ever look at their recommendations, but when they want their recommendations, they want it now. So, pre-computing everybody’s recommendations for everything — that’s definitely what I’d call a big data problem. There’s a number of methods in Mahout for doing recommendation. There’s item-based, user-based. There’s slope 1, collaborative filtering. There’s all kinds of different versions of it.
(77:25) One in particular I’ll be talking about is collaborative filtering. With Mahout, I think I like Mahout for the early stages, but then I think if you really, really get serious in recommendation systems, I think most people kind of start writing their own where they customize Mahout or something like that. I like Mahout because I can get started with it easily. The number one thing that I really like about recommendation systems.
(77:30) One thing I like about Mahout is you don’t really — the system doesn’t have to know what it’s doing. It doesn’t need to know what the items are. There’s all kinds of funny images on the Internet about like, you bought World of Warcraft, so you might like KY Jelly or something like that. So, collaborative filtering doesn’t discriminate. It doesn’t care. All it knows is there are some correlations between two things, and it’s going to recommend it to you, and it may be embarrassing, or it may be good. It’s neat because it doesn’t really — it’s not caring about it. All it’s looking at are the relationships. And the reason this is important is because relationships are really easy to extract. The fact that I bought something — extracting that from my data is really easy. Extracting good features is really, really hard. Now that I’m taking a look at preferences instead of features — this throws a whole ton of shit out the window that I don’t have to worry about anymore. That’s why I like recommendation systems. You can kind of get them to do the same things; you just need to frame the problem as a recommendation problem. I’ve solved problems that aren’t recommendation, but they fit the recommendation model.
(78:41) The one thing you can do, though — collaborative filtering is based on similarity of users and items, so you can bake in features like maybe demographic information, where they live, how old they are, on the similarity metric. So you can kind of incrementally fold in features, but in the kind of holistic way, you just look at preferences, but you can fold in the features in parts of the framework.
(79:05) So what’s the point here? One is recommender systems, just by the nature of them, require a lot of data, and they require a lot of processing to perform all the recommendations. So, Hadoop is a natural fit just because it’s a lot of data. If you’re doing a recommendation system on a small amount of data, you’re going to have trouble; it really needs a lot of data. The next one is, relationships are pretty easy to extract with Hadoop at scale. I’m going through all the data, extracting those preferences out. And like I said, if you can fit your problem, whatever it is and whether it’s a recommendation problem or not, into the recommendation framework of the problem, you can do something interesting.
(81:26) Graphs are somewhat related to recommendation systems, but they can do other things. Graph is basically entities and connections of entities. If you imagine a time-series database, a time-series database, what you’re doing is you’re orienting data by time, and you exploit that fact. What a graph-oriented data store is you’re orienting things by relationships. I don’t necessarily care so much about what those entities are; I care more about the relationships, and I make it so analyzing those relationships are easier.
(82:07) There was one project called Giraph that is still really early-stage. I’m still struggling with pieces of it, but it seems like it has promise. Basically, it will do a number of things like shortest path and things like that. The next thing I want to point out is — so, some of the guys at the NSA published this thing called Graph 500 which is a metric, and they explained how they did that in Accumulo. You can build graphs in HBase and Accumulo. I’ve seen that done before, and that’s pretty cool. So, some of the things you can do in graphs is subgraph extraction. Subgraph extraction is something like, let’s say I had — so, this is my graph on LinkedIn (see slide 39). This is just me, so what this shows is — it clusters my people together. Here is, if I’m not mistaken, this is ClearEdge; these are Hadoop people that I don’t work with; I think this is people I knew at GreenPlum. This is interesting, but the way they group them together is this person is looking at this person’s relationship to the other people in my network. There’s nothing inherent with me in that. The way they calculated that is who are my friends and who are my friend’s friends, and which of those friend’s friends are my friends? So, that’s the kind of analysis you might do in a graph-type database where you’re extracting out little subcommunities.
(83:49) Another one is missing edge recommendation. You can kind of look at the recommendation problem as I’m recommending an edge. Like, hey, it would really make a lot of sense if this edge existed in your network, and that’s kind of a way to do recommendation. Another thing that I really like about graphs is they really build cool visualizations. I can’t even count the times that I’ve built a graph that’s basically completely, practically worthless, but the executives were like, oh that is the coolest thing I’ve ever seen. And then summarizing relationships, you can kind of infer some things like hey, I knew a lot of people at GreenPlum, but ClearEdge is a lot smaller company. There’s a bunch of people that I’m friends with but I don’t really interact with at work and things like that.
(84:33) The other thing I wanted to mention was clustering. Clustering is another thing that’s in Mahout, but you can do clustering in other things. I’m not terribly a huge fan of clustering; I think have a hard time extracting actual value from clustering, but clustering is still interesting is I guess the best way of putting it in my experience. I like to run something like K means clustering and see what the clusters pop out. It kind of helps you understand the data. But in terms of training a model, I’ve had mixed results. There’s a number of clustering algorithms like K means, for example, that parallelize really well. So, that’s good.
R and Hadoop
(85:06) The other thing I wanted to mention was this project — there’s a couple of projects that mix R and Hadoop so you don’t have to pull it out and context-switch. There’s RHIPE, and then there’s Rhadoop, and what they basically allow you to do is write MapReduce jobs in R. So, it’s like this layer on top, and I can write my Mapper and Reducer — instead of writing it in Java, I write it in R. And the reason why that’s interesting is not because I like R better than Java — well, I do — but, it’s interesting because I can integrate with all the things I would want to do in R. So, maybe if I want to train a classifier in the Reduce phase, I can just do that right there in R. You can also use Hadoop streaming, which is something I’ll get into to use R as well. Don’t be fooled; this doesn’t automatically parallelize all your R code. All it does is allow you to use — it’s not R with Hadoop underneath. It’s kind of the other way around. It’s that you’re writing Hadoop stuff with R.
(86:12) To wrap up, basically, Hadoop is good at certain things and you kind of have to do the rest.
SQL and key value stores vs. Hadoop for data science
(86:48) SQL is a good data analysis language. It’s been around for a long time; that’s why they built it. So, SQL is really good. I think the reason I don’t like SQL is because of the cool things that I mentioned about Hadoop, like having to ETL things — so, there’s Hive, for example that I mentioned, that gives you a SQL interface. You have to prepare data for Hive. So, in certain organizations where you have a bunch of data scientists that don’t necessarily want to go and run a bunch of MapReduce, they like a pristine environment — what you’ll do is you’ll have these — I guess they call them “data engineers” sometimes — you use Hadoop kind of this pre-processing platform to put it into a system like Hive that maybe people are more productive in. There may be this step where I prepare data in Hive and then I do my Machine Learning stuff and calculations for like TF-IDF and stuff.
(87:43) Key value stores are interesting. So, key value stores aren’t going to go and replace your SQL databases. Key value stores give you a little bit more flexibility, like flexible schemas and things like that, but you’re pretty much limited to either fetching on a certain record or doing range scans or maybe a full table scan. So, your axis patterns are kind of the thing — you have to think about it ahead of time, like how I’m going to store the data. You have to mold the data store to the analytic that you’re running is I guess my point. You still need to be thoughtful about how you’re setting that up. It won’t really solve your problems just by being a key value store; you still have to think, I guess, is the problem.
Why not just use a graph database?
(88:39) I’ve played around with Neo4j; that sounds fantastic. I’ve done some pretty slick graph stuff in just Redis, for example. So, there’s a lot of graph databases, and they’re cool. In Neo4j’s case, for example, scaling is problematic, and I think a lot of the graph databases haven’t really implemented the graph algorithms with parallelism in mind. A lot of them you implement them slightly differently. Kind of going back to the problem of pre-processing, extracting the links between things from the raw data, that can definitely be something that’s done in Hadoop where we extract all the links out of the data. Imagine it’s like patients to medications. I’m going to extract all that data from my data in Hadoop, which is a lot of data, because Neo4j also — graph databases in general, because they’re oriented on relationship, they’re not really great at storing the actual data, like the actual record of the data about the event that caused this link. Or maybe there’s an individual record that created multiple links. In the email case, if I send one email, it could be to like three guys on the “To:” line and two on the “CC:” line. I’m creating five relationships from one record. So, a graph-oriented database isn’t going to store records well. I guess what I’m trying to say is yeah, you could do with just a graph database, but you need other things, too.
Delegating and working with a data science team
(90:36) I’m the CTO at ClearEdge, so I delegate a lot. I still — I think, with ClearEdge being a small company, I still have to do a lot of things myself just because I can’t find anybody else that can do it. As, I think, a really effective team — I posted a lot of crap on this stuff, and I posted a lot of stuff I don’t fully understand, honestly. And the reason for that is because it’s a huge space; you can’t find somebody that understands everything. I’ve dabbled a little bit in NLP, but somebody’s that done a lot of stuff in NLP is going to be an order of magnitude better at it than me, than I’m doing NLP. So, I think it’s about finding a team of people that can solve your problem. And I think when I look at a team, the kind of people that I want to look at is you need to have vertical and horizontal experts, I kind of like to call them. The vertical experts are going to give you insight on the data itself; they understand the data. Maybe they were ex-Oracle guys that have been working with this data. You need horizontal expertise. I have a feeling that I’m I’m going to be doing stuff with graphs, I have a feeling that I’m going to be doing stuff with text, so I’ll need data scientists there. But then, those guys writing data cleansing code, writing some of the core MapReduce jobs, those could be delegated out to software engineers.
(92:42) I definitely like, from a consulting perspective, hiring people that have broader skills and it’s more of a business reason. The business reason is, if somebody’s so focused on random technology, he’s going to be spending a lot of time on the bench. If the guy can do a number of things — like, I could probably pass as an NLP expert, if I had to. That’s the kind of thing I look for, maybe, and then maybe you have one specific thing that you’re really good at. If you have the ability, building teams is really important, or else there’s just too much stuff going on. Two, it’s not just teams of data scientists. You need the operations guys. You need software engineers. Absolutely.
(93:50) It’s about the sum of the knowledge. Getting back to team building, I also like diversity a lot, because there’s a lot of insights going on when you’re working on this type of stuff, and you need a lot of different backgrounds. I usually don’t like to put somebody else on a team if they match up nicely to somebody else. If there’s too much overlap with somebody else on the team, I won’t put them on the team.