XLKitLearn

XLKitLearn


I designed XLKitLearn to teach machine learning to non-technical students without the distraction of code - it exposes the power of scikit-learn through Excel, and works on PC and Mac computers. Students use it to fit random forests, boosted trees, and carry out Latent Dirichlet Allocation on large datasets, all in Excel. It has changed the way I teach data science and analytics, in my Business Analytics 2 class.

Are you looking to using XLKitLearn? If so, skip this page and go to xlkitlearn.com for installation instructions, and a link to a user manual.

Are you looking for background on the tool and pedagogical notes? If so, read on. You are welcome to use the add-in for your own classes (note, however, that if the add-in is run on a computer with internet access, every run logs the email address of the user along with the add-in settings and any errors for debugging purposes, and to warn the user if they are using an old version of the add-in).

The add-in is completely open source - the code is available here. Please do reach out if you decide to use it - I'm happy to answer any questions, provide whatever support I can, and discuss potential future improvements.

Highlights

  • Ability to fit a number of linear and tree-based predictive analytic models, as well as basic text analytic capability (encoding, and Latent Dirichlet Allocation).
  • One-step installation on PC and Mac, with all details taken care of.
  • Fully cross-platform - works natively on Windows and Mac computers - no need for any parallels environment of any kind.
  • Every run outputs a piece of python code that carries out the same analysis. The code is dynamically-generated for every run to use the simplest sklearn function for the specific model being fit. As such, XLKitLearn can act as a "gateway" to full Python.

The following short demos should give you an idea of how the add-in works

  • Predictive analytics functionality
  • Text analytics functionality
  • You might also be interested in this introductory video I use to introduce the add-in in my classes; it discusses the general mechanics of changing the add-in settings and running it - you might find it useful as an intro video if you decide to use the add-in for your classes.

Why design a brand new tool?

Before designing XLKitLearn, I did a broad search to see what other approaches existed to teach non-technical students data science. I found three approaches, but none met my needs exactly, hence my decision to create something new.

  • Other Excel Add-ins. A number of Excel add-ins that already exist; they function similarly to XLKitLearn, often with far more advanced functionality. I had two main concerns with using them, however, in order of importance (1) as far as I know, none of these add-ins are based on any widely-available ML libraries, and they do not make their code publicly available. This makes them limiting - it's hard to know exactly what they're doing under the hood, and there's no way to go beyond the functionality they offer. The whole idea of my class is to provide my students with the tools to communicate with data science teams in their respective companies - in that respect, it was essential for me to have a tool that would link directly back to scikitlearn (2) these add-ins typically only work natively on a PC, not on a Mac - an increasing number of my students use Mac computers, and find it cumbersome to install a parallels environment. These parallels environment are also sometimes a little unstable.
  • No code or hands-on work. A perfectly respectable approach is simply to eschew all hands-on work, and simply present students with the results of the algorithms for discussion. This works, but I wanted my class to include the excitement - and experience - of actually working with data.
  • Copy-pasted Code Segments. Some classes I've seen take a 'code snippet' approach - they provide bits of Python or R code, and get students to copyx and paste them. I'm not a fan of this method, and I think it the worst of all worlds. Having to worry about syntax makes it harder for students to focus on the underlying principles. It also gives students a false sense of confidence - being able to paste a piece of R code that does one specific task on one specific dataset does not a data scientist make. With the Excel approach, it is at least clear what is and isn't being taught, and students can go in-depth on the principles. A final concern is that for certain complex, highly-tuned models, the code can get pretty hairy.

I have also found that even for technical students who know how to code, using a tool that allows them to focus on the data science without worrying about the syntax can be invaluable. XLKitLearn's code output can then be used to seamlessly transition to scikit-learn.