We’ve all been watching the amazing progress in AI applied to huge datasets. Better yet, those of us not tracking billions of photos or restaurant reviews have not been left out: transfer learning techniques are making it “easy” to improve specialized models with data gathered for more general tasks. And, thanks to frameworks like PyTorch, applied libraries like fast.ai, and the accessibility of massively parallel hardware through companies like FloydHub, the methods are within the reach of small teams or even individual devs.
Our favorite example illustrating the above is ULMFiT, where Jeremy Howard and Sebastian Ruder at fast.ai show how to accurately classify movie reviews with just a few hundred labeled examples — and a model trained on a giant corpus of general English text.
Fast.ai, however, had one trick up their sleeve besides a working knowledge of English and a handful of labeled reviews. They had a mountain of domain-specific text: 100,000 sample reviews, demonstrating the difference between plain English and movie-critic-ese. That got us thinking: just how much domain data is enough to bridge the gap between a few training examples and a generic language model?
It’s not an idle question. Frame helps companies of all sizes label, rate, and otherwise make sense of their customer conversations on channels like Zendesk, Intercom, and Slack. We can confidently state: there’s a big gap between “we hold few enough conversations we can review them by hand” and “we have enough data to train a model from scratch”. What can a growing customer success operation with do with just a few dozen labels and a few thousand relevant conversations?
The answer, it turns out, is a quite a bit! In this post we’re going to use the same movie review dataset to show transfer learning can make a huge difference even when you’re just beginning to collect data in your domain. We’ll review how transfer learning can aide language analysis, present our experiment, and finally show why even companies with scaled datasets should care!
In case you’re tuning in from outside the deep learning community, let’s do a quick refresher. Deep neural networks are the technique behind many recent AI headlines — especially advances in perceptual tasks such as understanding images, audio, or written language. For our purposes, the most important thing to understand about deep nets is that they are made up of layers (that’s the “deep” part), each of which transforms its input into a new representation that’s closer to the answer the network has been trained to find.
A common complaint with deep learning is that we don’t always know what’s happening in those middle layers… but often they’re designed to have clear, interpretable roles! For example, many language models utilize embedding layers that transform individual words or small phrases to a coordinate system that puts similar meanings close together. As an example, this might help a translation AI apply its experience with the word “great” when it runs into a phrase using the word “illustrious”. Helpful, since you don’t see the word “illustrious” in as many contexts!
Now things get interesting: a layer that “knows” that “illustrious = great” isn’t just good for translation — it could help when learning to estimate sentiment, cluster different viewpoints, or much more. That’s transfer learning, where (part of) a model learned for one task makes it easier to pick up another. In fact, this particular example has become so popular that improving general-purpose language models has become a field in itself!
Transfer learning isn’t just good for moving between tasks; it can help us specialize a general model to work better in a specific environment. For example, a general-English sentiment model might be a decent place to start predicting movie reviews, but might not know that a “taut, tense thriller” is considered a good thing.
That’s where Jeremy and Sebastian Rudder’s Universal Language Model Fine-Tuning for Text Classification (ULMFiT) comes in. They refined a general language model with 100,000 IMDB reviews. Even though only a few hundred were labeled, the rest could help their AI learn that reviewers often substitute “taut, tense” with “illustrious” — or just “great” — bridging the gap to our labeled data. The results were impressive: 94% classification accuracy with just 500 labeled examples.
How Much (Unlabeled Data) is Enough?
ULMFiT shows a terrific proof of concept for NLP shops to adapt their models to work more effectively with smaller datasets. For Frame, we work with teams across a few different functions (success, support, and sales) and those teams are found within companies that span a huge spectrum of industry verticals and business maturity. From a data perspective that means we see conversational datasets that are highly contextual to a partner in the language they use and vary dramatically in size — both are situations of which would benefit from what has been demonstrated by ULMFiT.
For this research we focus on answering the following:
If I have a small, fixed budget for labeled examples, how much unlabeled domain-specific data do I have to collect to make effective use of transfer learning?
We answered the above with an experiment that pairs with fast.ai’s like this: they used a large, fixed pool of domain data and varied the number of labeled examples to show how the model improved. We held the number of labeled examples constant and varied the amount of additional unlabeled domain examples. More formally, our experiment consists of
- Language Modeling (variant)
- Language Task (invariant)
Our language task, sentiment classification, is the same as the task in the original ULMFiT paper and uses the IMDB movie review dataset. We hold the number of labeled sentiment training examples to 500 across all experiments, a number we thought was attainable for many small domains and would help emphasize the differential lifting power of different language models.
For language modeling, we vary the amount of domain data available to three alternative language models feeding into the language task:
- ULM Only: this is using Wikitext103 pre-trained English language model
- Domain Only: a domain based language model trained only on IMDB data
- ULM + Domain: the ULMFiT model
Training these models is computationally intense. At the largest domain sizes training can take a few days to complete on a typical work computer. To speed things up and to efficiently execute our experiment grid search in parallel we used FloydHub. (Additionally, the folks over at FloydHub keep their machines on the cutting edge, having both PyTorch v1 and fastai v1 capable GPU machines available almost immediately after they were released.)
After about 50 hours worth of intense GPU processing — but only 3 hours of wall clock time, thanks to FloydHub — we have our results!
What we see above tells a clear story:
- Considering both broad language structures from the ULM and unlabeled domain text always leads to a major improvement — even when domain text is minimal.
- We can get 75% of the performance UMLFiT reported with 33% of the domain data.
- Amazingly, ULM + 2,000 domain examples reached nearly 85% language task prediction accuracy.
Making Machine Learning Work for Everyone
At Frame, we’re obsessed with transfer learning because it enables a collaborative approach to AI: We think NLP should be a tool that our users can work with to explore and express a view about their own data — not just a way of transmitting received wisdom from some other corpus. When you start from that point of view, the question is less “how much data can I aggregate to inform a generic model”, and more “what is the narrowest domain for which I can develop a useful, specialized model”. How can we target specific use cases, and work from day zero, regardless of the amount of available data?
Our results confirm the value of the ULM+domain approach — it allows you to gracefully improve on specialized tasks as more unlabeled domain data becomes available. Moreover, we’ve shown the improvement comes rapidly, generating many of the learning benefits demonstrated in the ULMFiT paper with a fraction of the unlabeled data. Mapping how passively collected domain data improves our models helps us make informed decisions about when to surface domain-specific models, such as our neural tags — important both for our costs and our guidance to customers.
This is great news for any company that delivers NLU data products based on emerging or rapidly evolving language domains. Shortening the time-to insight doesn’t just mean we can target narrower domains. It means models can adapt to distributional drift over-time by shrinking our training windows, and lets us build much more granular hyperlocal models that focus on just one customer segment/location/etc!
To play with these approaches on your own, check out our repo and launch your own workspace on FloydHub by clicking here. Once in the FloydHub workspace, just open up 00_start_here.ipynb and follow along.