Doesn’t matter how many NLP frameworks/libraries/models are available out there, eventually you’ll need a language model that has no .zip made publicly available by some good guys. Just because whole life is a kind of N+1 problem. So, I’ve wrote small article with basic walk-through over the process of solving such a problem in easy way.
In such situation, I think, highest chances are you’ll need sentence breaker and/or tokenizer as first step. Sometimes it might be even the only step you need to make on your own. Sure, there are universal algorithms, like whitespace tokenizer, char-level representations, sub-word embeddings etc. But what if that doesn’t work for you? What if you need some nice & simple sentence breaker that’s able to distinguish the difference of point symbol used at the end of sentence, and point symbol used in abbreviation? Or tokenizer that’s able to catch specific phrase patterns? Actual requirements may vary, sure, but you get the idea: real world problem can easily have special requirements that are not covered by some awesome model you can download from the web.
Deep learning can help us here and, luckily, this kind of models are relatively easy to build & train these days. For my small task sentence breaker/tokenizer had couple of size & perf related requirements, so I’ve decided to start with LSTM prototype.
Disclaimer: Any framework can be used to build solution for this problem, but since that’s my article, snippets will be in language/framework of my choice. There will be no model to download etc — it’s just an overview of a problem/solution, and real solution might (and most probably will!) be different for your problem.
Step 0. Architecture.
Why LSTM? Because problem I’m going to solve can be easily framed as sequential prediction problem. And LSTM is rather simple, efficient for this kind of problems and has cuDNN implementation. Since in this subproject sentences will be relatively short, I won’t use optimization tricks like TBPTT, just can’t see any sense bothering with it in initial version, but in future, if this part of project will require bigger chunks of text to be processed at once — it can be added in no time.
Step 1. Corpus.
From my own experience — getting training corpus for some language that has no publicly available models — is the biggest problem. But with the help of luck & google — you might be able to find something suitable. I.e. PoS-tagged corpus? Even if you don’t need parts of speech, such a corpus will have tokenization applied. In worst case — you’ll have to apply tokenization manually & use more of augmentation to increase corpus size.
Step 2. Pipeline.
Once you have corpus ready, it’s almost time to transform it to some format suitable for training. But before that we should consider couple of things:
- text normalization: task-specific stuff mostly. I.e. if you don’t need numbers — replace them with something common. Or replace rare symbols with more common symbols. As I’ve already said: task-specific processes :)
- augmentation: cheap way to extend your corpus. It’s often considered good idea to introduce some common typos to the corpus. Replace some words with common abbreviations, or even make use of some thesaurus vocabularies. Lots of options here.
- iterators: Iterator is just a programming pattern. You organize your data as a flow of “sentences”, which are translated into a flow of “characters”, with some optional preprocessing (like mentioned above) applied on the fly.
And make sure all your pipeline “improvements” do not hurt quality & quantity of the target patterns you’ll be looking for in your data. I.e. if you’re going to detect medical tokens like “dichloro-6-methyl-something-3-hydrid” — make sure you don’t completely remove numbers or hyphens from your corpus.
Step 3. Representation.
Since we don’t have tokenizer yet (we’re kind of building it!), we’re basically limited to sub-word representations. Easiest one is a char-level representation: we build dictionary of all characters, and feed them into neural network as one-hot encoding. Since, I’m going with LSTM, data should be represented as sequence of characters in 3D tensors: [batchSize, dictionarySize, timeSteps].
Step 4. Model.
Model design is a tough task, tuning takes time etc, but we need something to start with, right? For the prototype I’ve decided to go with single unidirectional LSTM layer, followed by 2 independent output layers. Many-To-Many recurrent network basically. Sentence breaker output is a binary classifier, with class 0 used for regular characters, and class 1 used for sentence break. Tokenizer output uses 3 “classes”: class 0 used for regular characters, class 1 used for “token start” event, class 2 used for token end effect. Pretty trivial setup here.
So, 1 input, 1 LSTM layer, 2 outputs: one for sentence ends, other one for tokens. Sure, it’s doable in one task as well, but I’ve decided to go this way due to some “production” requirements.
The only “trick” used here is the use of weighted loss function, but please take a look at the end of next step before complaining about it.
Total Parameters: 302839
Pretty small network here, so it will be affordable to any kind of hardware, hopefully.
Step 5. Training & Evaluation.
Most simple part here.
Some time later you’ll see something like this:
AUC (Area under ROC Curve): 0.9845094309720785
AUPRC (Area under Precision/Recall Curve): 0.96599434700799
Sure, there’s always room for improvement, but there’s a lot of articles explaining how to build better models out there.
Step 6. Use in production.
This step will be different for everyone I guess. In my case — it’s pretty simple functional wrapper, that takes Java String in and:
- applies some minimal preprocessing
- converts text to chars & one-hot representation
- calls model.output()
- rolls over predictions to build sequence of tokens split into sentences
- returns those back
Once framework I’ve used will get support for native graph exports, this network may be used as part of bigger graph directly, without any Java glue, but for now I’m quite okay with this approach, since it gets me breaker/tokenizer performance I need in single core execution.
So, 6 steps later, I have something I can use for the next steps of my project. No doubts, eventually this model will be significantly improved, but for now it’s ok.