Today, we’re featuring a conversation Sean Campbell, Cascade Insights’ CEO, recently had with Thomas Miller. Thomas is a faculty member at Northwestern University and has written a really fascinating book on predictive data analytics called Web and Network Data Science: Modeling Techniques in Predictive Analytics.
Sean:
So my first question for you, Tom, is if you would share with us a little bit about the book and also about yourself.
Tom: Certainly. Well, the book that Sean’s referring to is my web and network data science book, which was published at the end of last year, 2014. It covers two areas: web analytics and social network analysis (referred to as network science), and tries to bring them together to show the overlap. There is, of course, a section in there and it’s kind of a mutual admiration society. We refer to Sean’s book on Going Beyond Google as well.
What we try to do in the web and network data science book, which is intended for a course we do at Northwestern, is provide students with the right kind of skills and overview that they need to gather information from this humongous resource, the world wide web.
There’s just so much out there, and it’s hard to know how to pursue it, and you need to have the skills to do it. It’s not just point and click. You also have to have some computer programming skills to do it. (Well, to do it most efficiently.) That’s what the book is oriented toward — to provide that overview — and then we use it in the course. The update of the course will have the same name, Web and Network Data Science.
I’m also involved in independent consulting. I’m a kind of entrepreneur myself; I have a small company that is pretty much in startup mode right now, oriented around data science.
Sean:
Excellent. I think the book you’ve put together is a great resource. I want to ask you a question that ties into your comment about this propensity among a certain type of individual. I’m not trying to be pejorative here, but I’ve met this kind of individual at conferences and in companies we’ve worked with over and over again. This person is fond of saying something like, “Well you know, Internet data is just data. You can’t derive any real meaning from it. It’s just information.”
Do you run into these types of folks? I imagine you do.
Tom: Well, it could be a matter of people being used to what was done in the past where they were doing primary research, design research, and custom research: they collected data for the purposes of a study. When we’re talking about the internet, we’re talking about this massive secondary data resource where you don’t have to design the study in advance. You’re using data that are already available.
That’s part of it, I think, is this education that’s needed as far as the value of secondary research generally and secondary information sources.
I think the other part of it is a lack of understanding of unstructured or semi-structured text, and what you can learn from that.
I think the other part of it is a lack of understanding of unstructured or semi-structured text, and what you can learn from that. In traditional research, statistics courses that you may have had in your undergraduate days (and many people have had them in one form or another), they deal with spreadsheet-like data where it was in columns and traditional databases and relational databases, and they don’t do a lot with text.
The web is text. It’s mostly text. You have to dig into that, and you have to extract. You have to scrape.
The web is text. It’s mostly text. You have to dig into that, and you have to extract. You have to scrape. You have to do the kinds of things that we do in network data science to get some meaning out of it. You have texts, which are unstructured. Somehow you translate those into numbers, which are analyzable in your models. That’s a challenge, and people are not very well educated in that area.
Sean:
Those are all really good points. Particularly the point about quantitative analysis vs. text analysis. The Internet is so text heavy, and people sometimes don’t have the right tools to, in essence, turn that into something that’s a little more quantitative for them.
Could you describe the tools that you think are critical to this type of analysis? For example, I know you mention Python quite a bit in the book.
Tom:
Python is used in the book, and it’s used extensively in data science. Python is primary. That’s because it is such a wonderful language for data munging and data preparation, sometimes called parsing of data. It’s a well-structured language — you don’t have a compile cycle, so you can do things more quickly.
Python is primary. That’s because it is such a wonderful language for data munging and data preparation, sometimes called parsing of data. It’s a well-structured language — you don’t have a compile cycle, so you can do things more quickly.
Prototyping is a lot faster in Python than in other languages, and it’s become the de facto standard you might say for systems programming work. It has replaced Pearl in many contexts. Because of that, it’s a good initial language to learn. With Python, there’s also the possibility of easily working with databases.
When we talk about databases in this context, we’re not talking about relational databases that you might look at in, for example, an accounting environment. In financial accounting, you have all these dollar signs and numbers to keep track of in rows and columns. What we’re talking about with text is we have to have an unstructured format or at least a more flexible structure. That’s either XML, which is markup language, or JSON — and JSON in the web world is primary these days because you can actually read it and read it with ease.
You have the general purpose language for manipulating text, for parsing: that’s Python. You have packages within that world of Python to extract data, so you can scrape (or crawl as it’s sometimes called) spidering the web to gather data.
Then you have to scrape the web pages because they’re marked up with HTML (which is structured like text) and you have to get rid of all those extra tags and codes and gather the data that you want — often from within the paragraph tag. That requires a good Python package to do.
Once you have the text, and it’s organized into different regions, you have to have a place to put that text. Initially, that place could be XML to allow things like the JSON format, which is JavaScript Object Notation, in the database format.
With JSON, it’s readable and you’re matching up the codes. If you had, say, an email message, you have the “from” node, the “to” node, the cc and the bcc, and then you have the subject and you have the body and you have the web identifier. All of those are keys that can then link to the values associated with them. Some of those values are individual values like the from-node, or an entire array of values, like the to-node and the bcc and the cc.
All of that can go into this semi-structured, you might say, format called JSON. Then that’s that last file you have to understand, to understand the file structure, JSON. Now you’ve got it in JSON and now you want to put it in a database so you can actually do queries on it, and there are a lot of ways of doing that. In the databases, there are specialized databases like MongoDB, but also some other tools, for example, PostgreSQL where you can bring JSON in and work with that semi-structured text and JSON, within a traditional relational database as a JSON extension. You’ve got all that.
Now you have the other issue of querying the data. Well, how do you query text data? You’ve got to query text data with flexible tools.
The traditional database tools are not sufficient because they require a look for a specific value, when in text, you can say something in many ways and often.
The traditional database tools are not sufficient because they require a look for a specific value, when in text, you can say something in many ways and often. And if they were taking email messages, people are typing. Well, people make mistakes when they type, so how do you do a search across that? In that regard, you need to have an actual language-based tool, or, say, a tool like elastic search that’ll do an effective search across the email messages that you’re looking for.
All of those are kind of standard toolbox utilities that you need to have to work effectively in the domain of the web.
Sean:
So how technical do you need to be to mine these type of assets? By that, I mean whether you’re talking about social network analysis or mining communities or web traffic statistics — or anything else for that matter — related to the subject at hand.
Because my experience has been it depends on the data source you’re looking at, as well as how structured that data is and whether the target website gives you an API or gives at least decent search capabilities. Hence, I know it’s not an easy thing to answer “Yes, you need to be technical” or “No, you don’t need to be technical.” As it obviously seems to depend a lot on the data source and the questions you’re asking.
Tom:
Yes. I think many of the studies we’re talking about are team efforts, so first there’s a modeler. The modeler doesn’t always understand the IT part, the database part.
There’s the person who does the parsing to put the data into the database, and that person may not understand databases. We have forty faculty and all the skills we could imagine, but if I were to query one of those faculty members who may be teaching, say, the machine learning course and ask, “Could you take on a section of the database course this term?” often I get resistance: “I’m not a database person.”
It’s hard to know it all, and what you need to do is if you’re an independent consultant is outsource. You bring in the other expertise that you need because it’s not only hard to know it all, it’s also changing. Every year there’s some new technology that’s out there, the latest and greatest. One person can be aware of what’s out there, yes, but one person can’t be a specialist, an expert, in every one of those areas. It’s just too much.
One person can be aware of what’s out there, yes, but one person can’t be a specialist, an expert, in every one of those areas. It’s just too much.
As far as what you can do easily: yes, there are companies that provide APIs, easy access and search facilities, so you don’t have to know anything technical. You just have to type in the string. Think of all the intelligence that’s behind Google itself and the search engines that are out there. Bing and Google, they have all kinds of algorithms and intelligence underneath them to give you the results, but then anyone can get those results. Any of us can type that in. It’s the people with the most technical skills that have the ability to dig deeper and be more efficient in their searches.
How much do you have to know to do basic stuff? Maybe next to nothing as far as specialized skills. But if you want to do a really detailed study, often you’re going to need to know more. This division between the IT and modelers is a real one. We see it all the time in the field of data science, and some people come to data science from IT and some people come to data science from statistics and from modeling. Often, they don’t speak to one another. Often, they’re speaking different languages as far as computers. It’s a challenge, but both are needed to do the work.
Think about the network science part itself, where you’re looking at links and followers and following a Twitter chain or a Facebook chain, or a network on LinkedIn. To analyze those requires special skills in network data modeling. Their IT folks don’t have that, usually. A lot of the modelers don’t have that. Network scientists would have that.
Sean:
That’s the interesting aspect, right? It’s like all new types of analysis efforts borrow a little bit from other disciplines but then are also inherently unique unto themselves.
So with that in mind, let’s address a few more areas. One is social network analysis. I think if you went to an average conference, discussion, or whatever that was on mining internet data, another logical sub-bullet to that would be, “Well, how do you do social network analysis?”
This is an area that the second you crack it open you start running into things like, “betweenness centrality,” and other things like that. Hence, I think some people quickly feel that social network analysis, in any meaningful way, is beyond them.
Given that, what are some of the baby steps people can take to at least become more aware of the space or the tool set? I mean, you’re educating students all the time, so when you start talking about social network analysis, how do you get them started in a way that they feel like they’re doing something meaningful in a fairly quick manner?
Tom:
Well, to do something meaningful, you’re going to have to employ some sort of program. There is a variety of programs out there, with some easier to use than others. I look at open source; I look at programs that are used for other purposes as well, primarily. In that regard, there’s a Python package called NetworkX. In R there’s iGraph, and some specialized packages on top of that deal with social network analysis.
You start there and you start with examples that are already completed. I can talk about email a little bit. In the book, I show Enron. We do an analysis of Enron and we use R in that particular analysis. Students in my class learn, first of all, how to produce the example. Then I ask them to take it another step further, to do something new, to explore it more extensively.
Now, there are some nice packages out there. One in particular, Gephi, is interactive. It’s easy to use so you can just point and click a graphical user interface. You don’t have to know a lot of programming to use it, so one way to explore, to learn initially, is to export the data from Python into a format that this Gephi program can understand. Then you bring the data into Gephi and you point and click your way to a new analysis.
…so a network of 300 people, can get really complicated, dense and hard to understand unless you can dig into it and find your way to the most important players.
The big challenge in networking is finding the interesting subnets. That’s been a real challenge. Even a network of 300 nodes or people, so a network of 300 people, can get really complicated, dense and hard to understand unless you can dig into it and find your way to the most important players. That’s what you were talking about when you mentioned betweenness centrality. Betweenness centrality, eigenvector centrality — these ideas are in some ways related to what we understand about Google and the links in Google, and references and prestige and “who knows what” and “who’s getting referred to.” All of that is related.
These are just ways of finding our way to the influential people. It might be people with power, it might be people who know more. It might be effectiveness, but it’s people who are referred to more. That’s a tool. The statistics relating to the people or the nodes of the network are ways of getting to the more important ones. When you identify the more important ones, then you have subnetworks you can work with.
One approach is what’s called an egocentric network. You can reduce the problem down to a smaller problem where you’re looking at one person, so it’s your network, Sean, and the people that you know. Then maybe one step beyond that, one order beyond that. You look at all the links that you have of the people you know and then look over one more level to the people they know. As that network grows, maybe not all the way because you’re going to mess with a few orders or magnitude of length, you’ve got a very large network to deal with. That’s one way of approaching it. So have a focus.
We have a project in one of our classes, the data prep class, where we’re looking at the Enron email data set, which is public domain, and the students are asked to see if they can find the culprits. Where were the problems? Where did they originate? To whom were these people talking? You’re looking for connections. You’re digging for problems in accounting. Or digging for problems with the California blackouts and brownouts that were occurring at that time that were tied into Enron. The students dig in the text to find the people who were talking about those things, then they find to whom those people were talking, and it’s a lot of fun. It’s detective work. Social networks are used that way.
I think one of the reasons social networks are being talked a lot about is because marketers see them as a way of finding people, finding new buyers. If you buy an iPhone, your friends might more likely buy iPhones. It’s a higher hit rate in terms of the marketing effort to go to the friends of the friends of people who are current customers. There’s not a lot of, I think, general interest in the theory of networks so much as finding a way to new customers. That’s what I think is the driving force behind that interest.
Sean:
I completely agree. I think that, especially in a business-to-business context, that’s a lot of where this is coming from and it’s a lot of where the commercial tool set obviously is pointing at this point.
One last broad question around predictive analytics: I think one of the things that intrigue people about this, beyond the examples you mentioned which are critically important, is also the fact that we can peer into the future using these types of approaches.
The example I give all the time is the one around search traffic statistics. If you had a competitor that was getting a three x increase in search to their website, specifically if they were selling a SaaS product where customers can just buy without talking to the sales team, that traffic increase is a clear indicator that there are going to be future sales for that company. Meaning that some of that web traffic will translate into a future sales number.
Hence, in many ways, web traffic analysis can be predictive because the sale cycle might last six to twelve months or twelve to eighteen months, so when you look at today’s statistics, you’re seeing a little bit of a view into the future – in terms of next year’s sales for the target company.
With that straightforward example in mind, what are some of the more interesting things you’ve seen when it comes to doing predictive analytics?
Tom:
Yes to all of what you just said. Predictive analytics is a term I look at as synonymous with data science. What we’re doing is we’re taking a business problem; understanding of a problem, and the business, taking data and IT and we’re creating models and putting them together. Ultimately, what we have to do is speak to the business problem. We have to solve that problem. A lot of times that problem is, “What are we going to do about sales,” or, “What are we going to do about our competitors and how are we going to increase sales?”
The tools that we’re talking about in that area are largely traditional statistical models or machine learning models that do essentially the same thing in a more flexible way. We have explanatory variables. We have things that we understand right now. You mentioned one of them which might be search engine performance, however measured, and we have the response. Response variables are what we’re trying to predict, which is, say, sales, sales in the next quarter.
We look at the path, and we see how those explanatory variables related to the response, and from that, we build a model. The models that we build are sometimes regression models when we’re trying to predict sales or quantitative response. Other times, we’re trying to build toward a classification model or we’re trying to predict the group. It might be buy or not buy, it might be pay off the loan or not pay off the loan, some kind of categorical response, often binary. Which brand a person’s going to purchase, that’s a categorical and multinomial response.
We build models that are classification models in that context. All of these things are well understood and there are many methods to deal with these problems, and that’s essentially what we do. I have another book which just came out this month, Marketing Data Science, that shows how to do that chapter by chapter. There’s a business problem in every chapter and it shows what kind of model could be used.
Even traditional models can be used. You don’t have to choose the machine learning algorithms to get something useful. You can make predictions about “what if,” types of questions. For example, what if we reduced the price of this product by ten percent? What could we expect in terms of sales response given the competitive environment? You play those kinds of simulation games within the context of the model. They’re exciting, interesting, and very important applications of predictive analytics.
Sean:
Those are all excellent examples. Well, I think we could keep talking for a while because we’re both pretty interested and invested in this space, but I’m going to have to wrap it for now. I want to thank you for joining me for this intriguing conversation!
Tom: Most welcome, Sean. I enjoyed it!