Don’s first presentation was at the NYC Pig User Group Meetup and entitled “Pig vs. MapReduce: When, Why, and How”. In his presentation, Don discusses how he chooses between Pig and MapReduce by considering developer and processing time, maintainability and deployment, repurposing engineers that are new to Java and Pig, and a number of other factors.
While there are numerous reasons to use Pig, I think a lot of people were surprised that the author of a book on MapReduce is such a huge Pig fan.
Here is a time-stamped overview of Don’s talk, along with the slides and video:
0:20 - Don’s background
3:46 - When should you use Pig?
5:11 - Why should you use Pig?
9:34 - Things that are harder to express in Pig
11:29 - Using Pig’s MapReduce relational operator for combining MapReduce and Pig scripts
14:04 - Calculating time you’ll spend using Pig vs. MapReduce (developer time vs. processing speed)
18:20 - Why is development so much faster in Pig?
21:50 - Speed to productivity in Pig vs. Java/MapReduce (repurposing engineers and SQL programmers)
24:36 - Leveraging UDFs with Pig
30:01 - Maintainability and Deployment of Pig vs. MapReduce
36:07 - Things that are harder to do with Pig
43:17 - Pig vs. Hive vs. MapReduce
56:30 - Pig/MapReduce analogies
58:22 - Presentation summary
Hope you enjoy Don’s presentation as much as we did. We’ll post Don’s talk at the NYC Data Science Meetup on “Hadoop for Data Science” soon. [Update: The slides and video for this talk are now posted here.]
Here’s the transcript:
(0:20) My talk is very plainly Pig vs. MapReduce, so hopefully it’s not too much of a surprise what I’m going to be talking about. A little bit of background on me, just to let you know where I’m coming from — I work for a company called ClearEdge IT Solutions. We’re a small boutique, primarily government contractor. You’ve been hearing about us, I guess, in the news a little bit. We as a company have been working with Hadoop for about four years now, and all of our projects now are Hadoop-related. One project in particular we’re almost 100% Pig, so we’ve been using Pig for a long time. As a company as a whole, we do a lot of that stuff, and now we’ve started working outside of the government as well. I work there.
(1:04) Some of the other stuff about me — I’m from Maryland, and I’ve lived in Maryland my whole life. I’m a Ravens fan. My dad is a Giants fan though, so I always watch the Giants games too, and I like it when they win as well — but I’d rather the Ravens win, unfortunately. I went to UMBC, which is a pretty small school in Maryland. I got my PhD there. I did some machine learning stuff before I even knew what Hadoop was, and then I worked at ClearEdge for a little bit and then I got the idea to write this book. Actually, I’m going to be able to weave this in a little bit into the talk. If you want to follow me on Twitter, it’s @donaldpminer. If you want to e-mail me, it’s firstname.lastname@example.org. I use StackOverflow a lot. I actually frequent the Pig one quite a bit and the Hadoop one quite a bit, so for some reason I get a good feeling about getting Internet points.
(2:01) Some of the technologies I particularly like — I like Python a lot, so I’ve been using Python for a long time. I like Hadoop a lot for some reason. I’m not sure why, but I just like it. And then I like Pig a lot as well. And then something I won’t be talking about — being kind of where I’m from, I use Accumulo a lot, which may be an interesting topic if you guys want to talk about it. So, that’s pretty much it about me.
(2:45) The other things I kind of want to caveat here is that I don’t even attempt to make my slides look good, so don’t bash me about that; I really don’t even care. So what I’ll be talking about is, I guess — I want you to understand what, currently today, Java MapReduce is good at and what Pig is good at, and when you want to make the decision to use one or the other. You should hopefully walk away from this talk saying, “Oh, I actually have a better understanding of when I should use one or the other.” Other thing I want to caveat with this talk is this is — I’m actually kind of nervous about giving this talk because there’s a lot of opinion in this presentation and a lot of what I feel is right, and I know that’s not all the cases. So, I may be making generalized statements that, hopefully, you can take into your own context and make your own decisions. I don’t believe I have all the right answers and stuff. I’m trying to be — I’m making a lot of generalizations here, I guess, and I realize that.
When should you use Pig?
(3:51) I built this flowchart (see slides 4 onward), and actually somebody asked me to do this one time, like a manager asked me to do this. This isn’t the one I gave him. So, the first question is, “can I use Pig to do this?” I’ve been presented with a problem, and I look at it and I say, “can I use Pig to do this, yes or no?” And if the answer is yes, then the answer is pretty easy; I’m just going to use Pig. There’s really no other reason to not use Pig if you can’t use Pig. That’s one major thesis statement that I’m making here; I really think that if you can naturally use Pig for a problem, you should definitely use Pig.
(4:28) Let’s say the answer is no. I look at it, and you know — oh, doing that piece will be tricky in Pig, this will probably be slow in Pig. What’s the answer if the answer is no? Well, my answer is try to use Pig anyways. Okay, so I don’t think I can do it, but I’m going to try because sometimes you’ll surprise yourself — and I’ll explain a little why I’m getting at this. You ask yourself, “okay I went and tried it, did that work?” And then if the answer is no, then okay, fine I’ll use Java MapReduce. This is actually 95% of the time what I do. Even if I’m sitting there like, man, this really doesn’t make sense to do in Pig, I’m still going to try it, and I have some reasons why. The most obvious one is if you can do it in Pig and save yourself the pain, that’s really what this talk is all about so I’ll get into that more. I think one thing that it really boils down to a lot of the time — the developer time, is that worth more than your machine time? I mean, it’s not like you’re trying to do a cost-benefit analysis/return on investment every time you’re running a pig job, but I think this is a pretty good rule of thumb. If it’s going to take me a week versus six hours maybe I’ll take the six hours.
Why should you use Pig?
(5:42) The other thing, too, is that just trying it out in Pig is not that risky of a proposition, I think, so that is the major point. If I try it out in Pig, and I bang away at it for an hour and a half, did I really lose that much? If anything, I actually learned about my problem, perhaps. Maybe I understand a little bit more about my data. Maybe I know that I’m going to have some bottlenecks in my reducers, right, because of the way my Pig job ran. Maybe I know I’m going to have to use a custom partitioner, I don’t know, but I learn things by trying it in Pig and knowing it didn’t work.
(6:14) The other thing that kind of happens most of the time is I try it in Pig, and I know it wasn’t natural, but I try it, but I get it to work. and now you’re in this interesting situation where you’re like okay I’ve got this thing that looks hacky and it’s really slow, do I push it into production or not? And I think a lot of times we just put it into production. So, it looks hacky, but it’s a Pig script, so who cares? So it’s a little bit slower, but who cares? In some situations, what’s the difference between 12 minutes and 20 minutes? It’s an extra cup of coffee.
(6:50) The other kind of reason is not only Pig faster, but it’s a different level of abstraction. (see slide 10) In general in software, you want to use the right tool for the job, so my analogy here is the screw is this big data problem that I have. Java MapReduce is more manual, but I’ve got a little bit more control. Like, I’ve definitely done some crazy shit with a drill before where I just went right through the wall or I hurt myself. So, I guess if I’m trying to do something carefully, personally, I’ll use Java MapReduce. Pig is kind of like the power drill. It will save me time. If it’s a good fit for using a power drill, it will work well, but maybe some problems aren’t good fits for power drills. I just decided to put HTML in it as a hammer. I don’t know why.
(7:37) Going into the next part of my talk, the first question that I think people — I think this is more of a force of habit. I think when somebody doesn’t know what question to ask, they always ask, “how fast is this?” “Oh, Pig seems nice; I have no idea what question I should ask, so I’ll ask how fast this is?” It kind of irritates me, but I think — so, the question is which is faster, Pig or Java MapReduce, and my answer to this question is — well, hypothetically, Pig is just MapReduce. So, if you think about it, MapReduce is like a superset of functionality over Pig. So, in theory, I could write anything in Java MapReduce and write it faster in Pig or at least as fast, hypothetically. But the way I kind of turn around the question is that — I think the true battle is the Pig optimizer versus the developer. Basically, is my developer that’s writing Pig at his desk, junior developer guy, is he smarter than the Pig optimizer in terms of optimizing his MapReduce? I’ve actually found that I trust the Pig optimizer more than some of my employees. I guess, my point is that a lot of thought has gone into baking into Pig and how to run things and pushing things up and filtering things at different times and aggregating things in different ways.
(8:58) I actually took a screenshot of (see slide 11) — this is all the different tweaks that you can do to the optimizer in Pig in the documentation, and you can see that it’s quite long. They’ve got like 9 or 10 different optimization rules, and you’ve got performance enhancers and all this crap. So, if you’re thinking your junior developer is sitting here thinking about all these things, that’s not the case. My point is is that at the end of the day, that Pig script might end up being faster just because it was able to figure out things your developer didn’t, so that’s one way that Pig can definitely be faster.
Things that are harder to express in Pig
(9:34) One place that I think I see Pig being astronomically slower — I’m not talking about just a 10% or 20% penalty just for distraction — is when you’re doing something that’s hard to express in Pig. So, Pig’s language is pretty simple, and sometimes you run into problems that can’t be easily expressed, so you’re doing things like you’re doing a bunch of DISTINCTs, maybe you’re loading the same dataset twice or something like if you’re trying to do a cross product on the same dataset, you have to load it twice. There’s all kinds of weird things that you have to do sometimes. I’ve tried to come up with a list here (see slide 12). Sometimes tricky groupings or joins — maybe I’m just doing something weird; maybe I’m combining lots of different datasets together, I guess that’s kind of like joins but different. Some sort of tricky cross products, tricky usage of the distributed cache. You can use replicated joins in Pig to utilize the distributed cache and then doing nested stuff in the FORREACH stuff I’ve seen have problems.
(10:54) I think the problem is is that I’m sitting there thinking, “hey, I could implement this in like two MapReduce jobs and Pig is sitting there doing it in like seven and the optimizer just didn’t do it — it’s not that the optimizer didn’t do it; it’s just that in terms of expressiveness, I wanted to express a way to do something and Pig couldn’t do it. So instead I had to do this haphazard, backwards way of generating join keys and then doing all kinds of crazy stuff. This is a place where I see — things that are hard to express in Pig are astronomically slower, but if you’re just doing like word count in one or the other, I’ve never cared about that speed difference.
(11:32) And one thing I really like — I actually remember when the MapReduce keyboard was released. There’s a keyboard in Pig if you don’t know about it. There’s a MapReduce keyboard where you can plug in a Java MapReduce job and feed it input and output out of your Pig script. And the reason why this is super cool is you can — as a best practice, I guess, you can only replace the things that are really hard to do in Pig and then keep all of the simple stuff still in Pig. And we like to do this because before this, we would have to replace the entire Pig script. So, this might be an eight or nine MapReduce Pig script — it may be easier for us to just rewrite the entire thing in Java than just re-implement this one little piece.
(12:14) Another cool thing about this is we’ve actually been in situations where we’ve re-used the MapReduce jobs and different Pig scripts, which is kind of neat. And then every now and then, we’ll be surprised when a new keyboard comes out and actually replaces some of the MapReduce jobs, which is kind of neat. So, I think that’s a really cool way to extend it. So I guess even if you really did care about not being able to express something, you could always just toss in your own MapReduce job. And actually, too, you can kind of mimic this. Through the usage of UDFs and GROUPBYs and FORREACHes, you can pretty much mimic MapReduce entirely. So, we’ve done some pretty complicated things just by passing back in and out of UDFs, too. This slide (see slide 13) is trying to tell you — okay, even if you’ve made all these arguments about how MapReduce can be faster and stuff, I still don’t care. I can still use MapReduce in the context of Pig.
(13:11) The MapReduce keyword — the way this works is you hand it the .jar file with the MapReduce job — this is something you would pass the Hadoop .jar — and then you give it a store location, where you’re storing it into, and you’re giving it where it’s loading from, and you give it the schema. The issue is, you need to tell it where the MapReduce job is going to write, because that’s outside of scope of the framework, and then what it’s going to do is Pig is going to run that job with Hadoop .jar kind of I guess, and then it’s going to load the thing that it’s stored into. So, this is the input and this is the output of that job (see slide 13), and then it’s going to load the output of this into the B relation in Pig.
Calculating time you’ll spend using Pig vs. MapReduce (developer time vs. processing speed)
(14:04) Expanding on one of my points earlier, is developer time really worthless and does speed really matter? These are two questions that me, as a personality, I’m definitely more of a — I want to call myself a hacker. I think some people call themselves hackers because theyre just lazy about process things, and I’m kind of in that category. I just want to get stuff done, and then I want to forget about it, and that’s mostly because I’m always thinking about what better use I could have of my time than sitting there writing documentation. I’m sure it’s valuable, but I don’t feel like doing it. Anyway, so — is developer time really worthless and does speed really matter?
(14:49) The cost-benefit analysis here where you weigh these two things is is time spent — if I took my hourly rate and I multiply it by the time spent writing this Pig job, meaning I debugged it and all that stuff, and then I add the runtime of the Pig job, times that job is run, and then also to — and you multiply the cost of that machine time, does that make sense? It’s hard to really gauge like how much does it cost for me to run a Pig job in a cluster; it’s kind of a weird number, but imagining that number is easy to retrieve, it’s actually pretty cheap, in most cases. I guess in Amazon you know. And then time spent maintaining that Pig job as well. And then on the Java side, it’s the same equation, right? It’s time spent running MapReduce jobs, time running the job, et cetera. I don’t really want to spend time thinking about how this equation works, but I guess to me what the interesting points are when does the scale tip drastically in one direction, right?
(15:43) One is will this job only be ran once? If you look at the equation when the job is only ran once, you’re going to get your major gains on this middle line here, over time, right? (see slide 14) Because the Java MapReduce job is running over and over again. If you’re only running it once, then it’s almost entirely — my time is the factor. This one is my favorite: are your Java programmers sloppy? (see slide 14) I have lots of sloppy Java programmers on my staff. By the way, if you ever wanted to ask me for consulting services, I’d send you the good guys. But, if they’re sloppy, the point I’m trying to make up is that maintaining sloppy Java code — if the people do things poorly — and in Java MapReduce, maybe some of you guys have run into this, writing very defensive code in terms of code bombing out because of data integrity issues, or — there’s all things that can go wrong in Java coding. Just lots of bugs littering the code; all kinds of things can go wrong. If they’re sloppy, it’s going to start being hard to maintain over time. Basically, Pig gives you less opportunity to write stupid stuff. It’s not that they’re not going to be sloppy in the Pig world, too, it’s just that they have less opportunity to be sloppy; it’s less amplified.
(17:06) Next one is, “is Java MapReduce significantly faster in this case?” In the case where it’s hard to express, that kind of idea I was trying to say. If we’re talking about the difference between an hour and like 50 hours, I’ve actually had that be the case. I’ve actually prototyped out a Pig script that was going to take like 40 hours, and I knew there was a better way to do it in MapReduce, and when we did it in MapReduce, it only took an hour. If that’s the case, then okay, you win; I’m going to write it in Java MapReduce. And I say this all the time — is 14 minutes really that different than 20 minutes? I mean, at a certain point, it’s batch and you’re off going and doing something else while you’re waiting for this thing to complete. We’re not talking about the difference between 10 seconds and half a second; that’s a pretty big difference. That’s a — 12 minutes and 20 minutes? I wouldn’t even argue that 12 minutes and 30 minutes are really that different. At that point, you’re almost at that band of batch where, like I said, you can just keep drinking coffee.
Why is development so much faster in Pig?
(18:20) I’d like to back up a little bit and say why — okay, I’m making this claim that writing Pig is so much faster than writing MapReduce, and I want to explain why I think this. The first is you don’t really have the opportunity to write Java-level bugs. So, okay, I write the For loop and I’m iterating too far or I’m doing to STRING and I’m not catching the exemption — there’s all kinds of crap that can go wrong when I’m writing Java code, just from when I’m writing Java code. And I don’t think I can even write Hello World in Java without trying to compile it like eight or nine times — maybe that’s a little exaggerated.
(18:58) The next reason that it’s faster is obviously you’re typing out less characters (see slide 15). Talking about — it’s just less verbose. It takes me less time to write it. This one is my favorite — we had a saying on one of the projects that we were working on: it’s not a normal day unless you’re having ClassMap issues. If you go through the day and you didn’t have ClassMap issues, there’s something else wrong, and you need to figure out what it is. The compilation and deployment — one of the reasons that I think this is particularly important is the compilation process, in my experience, is a huge pain in the ass because when you’re writing code in a data application, the data kind of lives with the data in a lot of ways. Out of context, the code is meaningless without the data. In order to see if your program is working, you need to run it through the data. So, unfortunately, you have to run iterations because — oh, I had the field mislabeled; oh, that’s a FLOAT instead of a STRING, all kinds of things can happen. The fact that I have to compile the freaking .jar, SEP it to the Hadoop cluster, go log in, do it, and then “oh, it didn’t work?” I’m going to go back and re-compile it, go back — I know there’s tools out there to make this better, but that life that I live in is just awful. Pig, on the other hand, I just have the Pig script on the cluster, I’m editing it with VI or whatever. I modify it, I run it, didn’t work, okay I’ll try it again. So that process is a lot quicker, and I actually really like that a lot, and I’ll get into that point a little bit more later.
(20:37) The next thing is, I legitimately think it’s easier to read. So one of the projects I worked on was converting a shit-ton of SQL. Just imagine a steaming pile of SQL shit, and it was our job to convert that all into Pig, and some of those we were converting into MapReduce as well. The issue is that SQL and Pig, in general, I think are much easier for me to understand what the hell the thing is trying to do. Converting SQL to Pig isn’t a one-to-one translation; you kind of have to sit there and read the SQL and — oh, that’s what this analytic is trying to do, and I’m going to go re-implement that in Pig. If I’m reading a Java MapReduce thing that’s like three MapReduce jobs and there’s 18 classes and I’m sitting there trying to read through all this boilerplate and For loops. If it’s not well documented, it’s going to take me a while to figure out what the intent of this analytic is, and that guy’s quit like two years ago, so we’re not calling him. So I think the easier to read thing is pretty important in an analytic project because really, at the end of the day, the analytic you’re writing has purpose, it’s doing something, and with a bunch of Java code, sometimes it’s hard to tell.
Speed to productivity in Pig vs. Java/MapReduce (repurposing engineers and SQL programmers)
(21:51) The next section — to organize this talk in terms of talking points — is, if I use Pig, I can avoid Java, in general. I actually hate Java. Maybe somebody will disagree with me, but I don’t like Java at all. I really like Pig because I don’t have to deal with Java, but some of the real arguments are — well, first of all, not everyone’s a Java expert. So, in a situation where I was starting a Hadoop project four years ago, and I think this is still the case, I’m not just going to go out on the street and be like, hey, are there any Hadoop developers out here that are looking for work? It doesn’t exist, so I think a very natural thing for a company to do is something I’ve done, and other people, is re-purposing SQL guys. At ClearEdge, before we started working on these Hadoop projects, we had a bunch of guys on Oracle projects. Lo and behold, there’s not much Oracle work for us anymore, so we re-purposed a lot of our SQL guys to write Pig and actually it works out pretty well. Let me get into some of these caveats, but people that are familiar with data-querying languages like SQL — I don’t really know too many — I think grasp onto Pig pretty easily. And re-purposing these guys to be useful is a good thing. I think the higher level of abstraction makes Pig, in general, easier to learn. So if I’m trying to teach an analyst — we actually had a project where analysts who had never programmed before were using Pig. That was challenging, I’m not going to lie, but I don’t think we would have ever been able to teach them Java. So in kind of my experience, I think in a project that has well-running Pig and Pig examples on the data and things — we usually have people actually committing story points within the first four days, which I was pretty proud of that because when we started the project and we were running Java MapReduce, it would take maybe a month, two months before someone wrote their first useful MapReduce job. With Pig, we would kind of have a list of “easier” ones which were maybe like a GROUPBY or other things. And after four days, those guys were cranking them out.
(24:04) I think at this point, we have some guys that write maybe one or two production Pig jobs a day. That’s not the case with Java MapReduce; you’re not going to do that. I’ve always liked the argument, “Oh, you want to learn Hadoop? Well, go learn Java.” I have lots of ideas for books. I have one idea for a book I have — and I’m not going to write it, somebody please take it and go write it — it’s Just Enough Java to Learn Hadoop, because it’s actually quite subset, but I don’t know. Can I just really completely ignore Java? Well, I don’t think you can (see slide 16).
Leveraging UDFs with Pig
(24:42) One trap that people fall into is they are very nervous about going outside of Pig, meaning they are very nervous about using UDFs and using custom storage functions and things. There’s probably somebody in this room that’s like, you were sitting at your desk one day and you were like, “I really should learn how to write UDFs,” and you were like, “Not today.” You really should learn, because I think Pig, in general, is meant to live without UDFs. I think they’re there because they have to be and you’re really selling yourself short. So some of the things that I do everyday in UDFs are, yeah, string operations. There’s some more complex math that you may want to be doing. Complex aggregates — so, like passing in a bag after a GROUPBY and actually doing something on those things inside of that bag. Any sort of natural language processing like stemming or tokenizing — I actually crossed out dates because I went and looked at the documentation, and in Pig 11 they just added a bunch of DATE functions, so that’s kind of neat.
(25:43) (see slide 18) I don’t see them going through and picking off these things and re-implementing a bunch of stuff, right? I think in general you need UDFs. I just told you you really need to use UDFs and I’m trying to convince you you don’t need to use Java, but — so I guess, this is my answer. Ok, you still really, really want to avoid Java. Well, what I do is I just have somebody else do it for me. This may be coming off weird, but writing a UDF is actually — for a junior developer, I find that they are actually really good at this. If you just had a junior developer that knows Java and you’re like, “data’s going to come in looking like this, and I want it coming out looking like this,” and they just write this little byte-sized piece of Java code, and I just go and do it. So actually on my larger teams that have a mix of SQL converts and Java MapReduce developers, the Pig guys actually offload this work to the Java guys as a separate story, and then they integrate it into their script. I guess my argument is, hey, if you’re really good at Pig and you don’t like learning Java, then just have somebody else do it. It’s not that much work; you just need to figure out how to get that worked out, organizationally.
(27:32) One question I get a lot about my book — so, a little bit about my book; it’s MapReduce Design Patterns. MapReduce Design Patterns is about writing Java MapReduce. That’s really what it’s about. There’s a bunch of Java MapReduce code in there, a little bit of Pig. One question I get is, why the hell did you write a book on Java MapReduce if you’re such a Pig bigot? And I guess it’s a good question. I like answering this question. I’ve seen lots of dumb stuff done in Pig, and what I mean by dumb stuff is, “oh, there’s a keyword for doing cross product. Alright, I’m just going to use that everywhere!” Or joins for example. If you don’t understand how joins work in MapReduce, using the right type of join, for example, is a complicated task for somebody that doesn’t understand that. So I guess my main point is that if you understand how MapReduce works from a — how would I do this in Java MapReduce, you’re going to be a better Pig developer, and you’re going to write better Pig. And I really think that’s true. It’s almost like understanding assembly is going to make you write better C. I think moreso in this case. That’s really my key argument, and I think that’s true with design patterns outside of MapReduce as well, like the Gang of Four book and things. If you understand these underlying principles, I think lots of things make more sense. That’s really why I think you should still read my book, even if you’re never planning on writing Java MapReduce. Actually in my book I have a section — each pattern has a section called a resemblance where I say how you would do this in Pig, which I was kind of embarrassed to even put this in the book because I’d spend like eight pages explaining how to do this. And then I’d show, like, this is how you do it in Pig in two lines. But I was like, I’m going to put it in there. And this is true of Hive as well, I think. Hive is actually an interesting argument here because I think it’s even worse with Hive because the people that are using Hive that were from SQL before — they’re kind of expecting it to behave like Oracle, right? And they’re like, “ah, this join should be fast because the data is ordered in this way and blah blah blah.” And it’s — no, that’s actually not the case. I almost don’t like Hive because it gives people this, like, false hope.
(30:05) This is one of my favorite reasons, and I actually built up a scenario here to illustrate my point. (see slide 20) So, imagine you’re on vacation, and you’re sitting on the beach, and your phone rings. You had your phone on. And so the IT guy is there. He says “hey, your MapReduce job is blowing up the cluster; there’s something wrong here, and we need to fix this.” Actually, this is not too far off from a real scenario. I was in Ocean City, Maryland, and I was on vacation, and one of our jobs was blowing up the cluster and I had to drive home to fix it. Honestly, at the spy agencies, it’s always like, “It’s a matter of national security!” And you’re like, “Alright, fine.” You can never say no to “it’s a matter of national security!” So, you respond with, “Ah, well, you know, that’s pretty easy. I remember that I put this new “IF” statement at the beginning of the mapper, and I know that that’s screwing us up. So all I want you to do is go comment that out.” So the guy says, “Alright. I’m an IT guy; how do I do that?” “It’s pretty easy. First, I’m going to check the code out of Git. And then I want you to download, install, and configure Eclipse. And don’t forget to set your CLASSPATH! (see slide 26) And then I want you to go to line 851 in that file (see path in slide 27).” That’s actually similar to a lot of the paths I deal with in Java. Oh, I didn’t capitalize M. That’s bad style on my part. And then, “now I want you to build the .jar. And now shift the .jar to the cluster and replace the old one. And now run the Hadoop .jar command. Don’t forget the CLASSPATH!” And then I ask if it worked, and the guy says no. And so you basically tell him, okay, let’s try something else and do that again, and the IT guy is pissed.
Maintainability and Deployment of Pig vs. MapReduce
(32:25) Let’s imagine this scenario with Pig and he says, “okay, something’s wrong.” You say, “Comment out the line that looks like filter blah blah whatever.” And he says okay. So in an operational Pig cluster or Hadoop cluster that’s running Pig, we have had to, so many times, go and modify the Pig script in place that the scheduler is picking up just because we needed to fix the problem quickly and that’s it. We needed to fix it right away. We didn’t have time to go — I guess the right way of doing this right is to have this all under version control, source control and stuff, right? That’s probably the right way of doing it, but shit happens like I’m just going to have to go change the Pig code, and I’ll handle the right way of doing it later. I think that problem in general — I think this has a lot to do with arguments of scripting languages versus compiled languages, but here’s some of the reasons why I really like Pig for operational deployment (see slide 38) of Hadoop versus — I think I have to worry less about version mismatch. When we upgraded Hadoop, we’d always be a little bit worried that the .jars that we compiled of the previous versions were going to not work. I think usually they worked, but sometimes we were a little bit nervous. So with Pig, I don’t think there’s been anything major that’s broken backwards compatibility with Pig scripts. The fact that you don’t have to compile it means I just leave these flat files out, right?
(33:51) The other thing that we like is we can have multiple Pig client libraries as well. So we could have Pig 8 and Pig 9 installed as clients and not have to worry about our production jobs blowing up, and we can just kind of keep going on with life. Maybe some people wouldn’t trust a new version, maybe some people wouldn’t blah blah blah, right? Maybe the cluster admins haven’t officially installed Pig 11 and I want to try it out. I found that to be really useful. The other thing that I kind of like to mention, taking compilation out of the bill process is just, for me — I guess it’s one of those things that keeps itching day after day. And after working with Hadoop for four years, that thing is just — every time I have to compile a .jar, I just get a little bit pissed off. It’s just been wearing on me a little bit too long. Like I said, you can make changes to scripts in place. I think the fact that you don’t have to compile it means you can iterate a little bit faster and I’ve kind of mentioned this, too, less chances to make Java-level bugs is a big point, too.
(34:58) There’s a couple caveats here (see slide 39). I think Hadoop streaming provides a lot of the same benefits. With Hadoop streaming, I can just write some Python script for the mapper and reducer. I don’t have to compile anything there; that’s quite nice. You get a lot of the same benefits out of this. This isn’t really a bash on MapReduce. It’s more just Pig in general. I think my contrived example before, with the guy on the beach, big problems are still going to be big problems. I guess my point was fixing a simple problem is hard, in some cases, but if you have a serious problem with your Pig script, you’re still going to have a serious problem with your Pig script. And obviously, too, if you’re using Java UDFs, you still have to compile something. That’s actually one of the reasons that I like using Python, in general, with this is because then I’m kind of in a compile-less environment entirely, and that’s kind of nice. This whole problem, I think — I can tell it doesn’t bother new Hadoop developers that much. I think it’s more the — I’ve just compiled so many .jars. I just can’t do it anymore. I can’t do it anymore.
Things that are harder to do with Pig
(36:06) Getting into one that I think maybe Pig struggles a little bit with is unstructured data (see slide 40). I’m getting into some of the things that Pig is not so great at. I think you can probably do anything you’d ever need to do in Pig. And obviously the same is true of Java MapReduce. If I can do it in Pig, I can do it in Java MapReduce, and I’m pretty sure if I can do it in Java MapReduce, I can do it in Pig. There’s enough hooks in Pig that I can modify it. It really comes down to does it feel right? Do I really want to do it? This is maybe one place that I’ve seen a couple problems. First of all, if the data is delimited, Pig is the obvious choice. I’ve never dealt with delimited data where it’s really been a big problem. But I’ve had specific issues in using Pig with these things, and I hope that one day they’re fixed. I don’t think these are inherently something wrong with Pig; I just think that there needs to be more custom stuff written about it. One is media, so images, media, and audio. This is actually hard in Java MapReduce, too, I think it’s just harder. It’s almost like this is something that Java MapReduce does not bill for out of the box, so when you’re trying to do something, it’s almost like you’re trying to use an abstraction over something that wasn’t built for doing something. It just gets worse. And that’s usually because splitting is a big problem in these file formats, and then obviously just dealing with that type of data is a big problem. Again, you’re going to be basically writing so many UDFs that you might as well just be writing Java MapReduce at that point.
(37:41) The next one is time series. One thing that really irritates me about Pig is I wish I could utilize order of data little bit more. There’s a couple of cases where order of data is important, like if I’m dealing with time and I want to deal with time local type things where maybe I’m slurping up like 100 lines at once, and they’re all time local, and I can do some sort of mini-analytic on that. The problem with Pig is that I need to do a GROUPBY and I need to ship all that data to the review server and then I do that there. In reality, the data is already grouped, so I see that pop up in time series analysis a lot. It actually happens in other places as well, where the data is sorted by something. It just happens to be that sorting by time is pretty easy, typically. And dealing with lists is sometimes problematic, depending on the list and time series.
(38:25) The next one is ambiguously delimited text. So like, quoted CSV. That is my worst nightmare, which actually, is not even true — which is Pig. I hate quoted CSV all the time, but like now I have to deal with quotes, and how do I split, and it’s just a pain in the ass. But usually a custom store function typically solves this problem, but I don’t know. The next one is log data in which the different rows have different meanings. So actually if you look at the Pig log output, there’s different meaning in different rows. There’s maybe an error message one day and then a success and then there’s this is how many lines it wrote. The lines have different formats based on what it’s trying to tell you. So, that’s not something that I can easily just put in store func tab delimited, right? That is problematic because the columns line up differently or — how do I even model that? To me, I can imagine ways to do it in Pig; it just seems more pain than it’s worth. And then I guess my kind of point here is there’s a certain point where I’m looking at it and I’m like, I’m going to be writing so many UDFs for this thing, I might as well just write it in MapReduce. I guess it’s kind of where my mentality goes.
(39:56) Then you can kind of say, well what about semi-structured data? And I think semi-structured data, I guess some cases it’s okay and some cases it’s not, so I kind of wanted to explain why I thought the case in not. Some forms are more natural to do. If you have a really well-designed JSON XML schema, like the guy that wrote it isn’t an idiot, it usually works out okay. Okay, that’s great. The issue is I’ve had issues using Pig against a couple of these things. Complex operations over unbounded lists (see slide 41). What I mean by unbounded lists is I could have any number of these things. Think of this as the “to:” field in an e-mail, maybe. Obviously I can model that as a bag, but then I’m doing all kinds of crazy things with FLATTEN and back and forth and writing UDFs to run against bags — it’s just been a pain in the ass. Usually, this one typically doesn’t stop me, but it usually requires me, at some point, to write a UDF to properly parse that thing. The next one is what I like to call very flexible schemas. So think of a JSON schema, but it’s very sparse. Like you could have millions of elements and usually only a few of them are filled. We typically see this a lot in HBase or Accumulo in big table designs. Sometimes I use the key — the label of the data item as part of the data itself, because the MapReduce jobs I’m running against this thing don’t care about what the name of the key is. As long as I’m not fetching on it, it could be state; it could be Social Security Number. I don’t know; it could be any number of things. Very flexible schema is when you’ve got — you’re talking about using the column space in a column or database as a kind of infinite number of columns, hypothetically. That, I’ve found, is problematic with Pig. And then the worst of all is — even maps are hard. So, I’ve ran into this one. Somebody modeled in HBase the 2 field as 212223. We still did it, but it was kind of irritating; it wasn’t natural. We still used Pig for it. On that same point, poorly designed JSON and XML. There’s so much rope to hang yourself in terms of these schemas. Like in that email analysis example, having two number 1, two number 2, two number 3 — you’d be surprised how many times I’ve seen something like that. And that exists out there, and it’s some proprietary format that somebody’s developed, and it’s never going away. You can’t tell some Fortune 500 company, “hey, you really should re-design your JSON!” It’s just not going to happen. When you deal with badly formatted things like this, it’s just a matter of life. You’re going to have to deal with it. This is kind of my point, again; sometimes it’s more pain than it’s worth. I’m not saying that you can’t do these things in Pig. I’m just saying, sometimes I’m sitting there like “ah, I kind of wish this was a little bit easier.”
Pig vs. Hive vs. MapReduce
(43:21) So I thought I’d throw a slide in here about Hive (see slide 42). I think a lot of the same arguments I could have probably found replaced on Pig and replaced Pig with Hive for a lot of these arguments. Maybe they would change a little bit. I think if you’re sitting there arguing, “should I use Pig or Hive?” I don’t really think it’s that big of a deal. I think the fact that you’re using Pig or Hive over Java MapReduce, you having that in your environment is a good thing. But I think you’re going to like Pig better than Hive. I guess since this is a Pig meetup, I can be honest, right? So, in my personal experience and in my personal opinion, the people that take Hadoop more seriously use Pig. Hopefully that doesn’t offend anybody. I think the reason why — if I was at a Hive meetup, I wouldn’t say something like that. I think the reason I like Pig more is that when I’m using Hive, I almost feel like I’m using this back-ported thing, like I feel like I’m playing an NES emulator on my laptop or something. It’s weird. It doesn’t feel right. There’s something weird about writing SQL and it doin MapReduce. It just doesn’t feel right. It feels back-ported to me. And the other thing I don’t like is, as somebody that’s written a lot of “real SQL,” when I type something out and Pig doesn’t support it, it just pisses me off.
(45:00) Pig, on the other hand, I feel like was built from the bottom-up. Hive, I think, was top-down design, right? They were like, “we want to build SQL; how do we build that?” Pig was like, “how do we abstract MapReduce?” So, it was much more of a natural transition of level of abstraction. Usually abstraction is built from the bottom up. Abstraction top-down doesn’t always work, I don’t know. That’s my opinion. It’s not science; it’s opinion.
(56:30) Kind of to wrap up, I’m going to wrap up with an analogy. This analogy has been made before, and there’s two analogies that I like to make, and I’ve had a hard time building just one analogy. One is to scripting languages and compiled languages. So, obviously, the compilation is similar. Python, I don’t have to compile Python per se. There’s some object code stuff going on there, but for the most part I don’t have to compile it. C, I have to compile. Link and all that stuff. Some other things in scripting you can kind of assume is inherently a little bit slower, maybe you can inherit some sort of underlying optimizations that would make it better. You can do in-line changes that goes on and on. The other analogy that I like to make — so, the previous analogy is like when you look at Python and SQL, the reason why that analogy doesn’t fully check out is that Python, in general, is not that much higher of a level of abstraction. Sure, you get rid of pointers and things like that, but it’s not like C to SQL. That’s the other piece of the analogy here where it’s like Pig in a lot of ways is like SQL is to C where it’s just completely — layer of abstraction entirely devoid of C, where you’re not interacting with strings and For loops and things like that. I’m interacting with the query language. So there’s these two kind of analogies, and you get things out of this like the code is more succinct, for sure, easier to read, easier to understand — I think, in some cases, if you’re good at SQL — the amount of time it takes to write something, the fact that this deals with data. Java MapReduce deals with data, obviously, but the code doesn’t, in a lot of ways. When you look at Pig code, you can tell it’s dealing with data. MapReduce is just If statements and stuff.
(58:27) This is kind of what I’ve been saying. I kind of listed it out in the past two slides (see slides 43 and 44), so this is a summary of basically all the points we’ve made. Some of the main reasons why I like to differentiate between the two: one, I hope I’ve driven in compilation into everybody’s head. The efficiency of the code — I think Pig is less efficient, but I hopefully have convinced you that you shouldn’t care. Lines of code have verbosity. I mean, I had a situation once where a new feature in Pig came out, and we were able to convert about three or four MapReduce jobs that we had written in Pig before but we couldn’t get to work, and we added this new keyword, and then they worked and we liked them better because of our maintainability aspects. So, we deleted from the SVN repo the MapReduce jobs and added the Pig code, and at the end of the quarter, it was my job to compile thesaurus lines of code counts and project management was pissed that we lost 300 lines of code at the end of the quarter, because that’s how they were measuring our effectiveness. And so I had to go explain to them, “well, we converted these things into this language called Pig and it’s much more succinct.” And they were like, “well, how many lines of Pig code is equal to lines of Java code?” And I was like, “holy shit,” and I actually had to go back, and it came out to be — by the way, the entire time I was running this, I was pissed. It came out to something like to 130 to 140 lines for those four jobs that we deleted. And then after we added some math, it showed us that we were very productive that quarter.
(1:00:07) I think that we touched on optimization a little bit. The optimization in Pig is really good, actually. The optimization in Hive is getting better, but I think Pig is better still. Tez is something that I think Pig might be able to utilize, actually, definitely, I think. I think actually Tez is going to make Pig a lot better. Somebody’s going to have to do the work, but I agree. Hortonworks is pouring a ton of money into Hive right now. I kind of wish they were pouring that money into Pig, but it is what it is. So, code portability — like I said, I like this a lot. The fact that I’m moving around a text file and not some sort of compiled piece of code. Readability is a big thing. One thing that I actually ran into yesterday — underlying bugs in Pig can be a huge pain in the ass. If there’s something that’s buggy in Pig, and I can’t get around it, that’s a huge, huge problem. I don’t know if other people do this, but when I was debugging my Pig code, I put some extra STORE functions to store out, like, limits of each of my stages of my data pipeline, kind of like a beefy ILLUSTRATE. And it worked, I was like what the fuck. So, the job worked when I had these STORE functions in, and I was like, “oh, I must have fixed it and forgot to save it or whatever,” so I delete the STORE functions and I run it and it doesn’t work. And I put the STORE functions back in and it works. At this point, I’m just storing off this terabyte of data just so that my job runs. It’s actually completely temporary; I don’t need it at all. Truthfully, though, thanks for open source, though, right? Sometimes I run into issues in Pig, and if it’s a big enough issue for me, I can go take a look and try to figure out what the hell was wrong with it, and even if I need to, I can fix it. So, usually I don’t have the time or really interest to fix it, but I at least know what the issue is so I can circumvent it. Like, if the code was closed away from me, I wouldn’t even know where that issue was coming from. I think this is a problem with layers of abstraction in general; when there’s something underneath a layer of abstraction, things just fall apart. The fact that you’re built on top of Hadoop, which I think is pretty bug-free these days, and in the kind of space of possibility, you can argue that in MapReduce you can do more things. I definitely have some patterns in my book, for example, that I would have a hard time writing in Pig. Some of the more complicated ones. So, that’s it.