Text Is Sexy

This is about written text stored under any digital form that has been or can be rendered in human-readable form on a screen, and that has originally been composed by a person. Such text can be either directly input by the person themselves using any available means, or be inferred from a medium carrying the original message (such as printed text on a sheet of paper that is later scanned or speech which is transcribed automatically to text).

But, what is text and does it really concern me?

It is generally accepted in linguistic and evolutionary theory that any known written system is long predated by the speech it records. Once made, this realization, no matter how obvious it should have been, leaves a long trail of consequences for the way we reason about written communication. If we consider a phonetic, alphabetical system of writing, in oversimplified terms any single character that can be mapped to a sound serves to represent an infinite number of variations of that sound as produced by any speaker of the language it belongs to, at any given time while the writing system has been in use. Sounds in isolation are rarely carriers of any message but the morphemes they form are already mapped to meaning of various degrees of abstraction in the knowledge shared across all speakers of that language. Morphemes in turn are the building blocks of words, many of which map to very concrete ideas, objects or phenomena in the natural world. For many modern languages words correspond to sequences of characters we refer to as tokens on the printed page or on the screen, separated by white space and/or punctuation. With words we build phrases that are then combined into clauses and sentences, which is what written text consists of. Each of the foregoing statements represents a naive generalization over a number of linguistic and behavioural or psychological theories and can be very easily challenged by any undergraduate student of linguistics (just ask them, should you chance upon one, what is a word, and you’ll see what I mean). Still, for the sake of argument, I’d like to treat written text in any given language as a conventional single representation of an infinite number of possible variations in spoken production.

In software engineering we can rarely afford to raise our heads above the sea of complexity involved into building even a simple website processing some sort of text input, to look beyond what goes literally from the keyboard to the database, and start philosophizing about language. Developers are typically not concerned with what a user actually writes or what a document transmitted as part of a transaction actually contains (unless they need to erect safeguards against, say, various types of injection attacks, or they are actually working on some smart content-mining feature, in which case they may be already in a frame of mind not unlike the one I’ve hinted at). In the general case, however, any feature directly dependent on the actual textual input will be pushed to the end of your project, when, as a rule, all time and money allocated will have long been spent. Of course, such features may have been discussed in broader terms at various stages of the project implementation, with the occasional mention of the great machine learning / natural language processing / artificial intelligence tech our company has developed and will kindly contribute to the project.

In truth, without access to the actual content to be processed by the system we’re building there is little that the implementation team can bring to the table, other than a general-purpose data pipeline. Besides, our agile project management would hate to see us spend time building features that may never be needed, come the actual content. Therefore, if not central to the project, any “smart” feature half-promised in the hope of winning the contract is likely to be put together in haste and with little regard for the actual content it will depend on in production. And because of that it is hardly likely to perform to the client’s satisfaction. If there’s enough trust left to our credit at this stage we might get away with a change request post the go-live date, which might once again end up on the back burner since the client is already live and we may have moved on to other projects and new clients. This seemingly dead end can hardly be attributed to the fault of any of the stakeholders, it is just the way of modern software development. While there can be no recipe for success in our field, let alone in developing data-driven features, in the case of human language encoded as written text an abstract framework of attitudes and precautions could be adopted to mitigate the effect of textual data not receiving proper attention while actual software development takes place.

Your data, your rules

What’s common to any piece of text produced by a human is that it will bear traces of the producer’s own person, even in the strictest of contexts. Then, as long as the one analyzing the text by automated means is not the one producing the message, pretty soon it will emerge that any major assumption of the analyst’s breaks down at the n-th example, where n is not a very large number. Simple things such as the presence of HTML tags or any other formatting markup, spelling and punctuation, the use of auto-formatting, hyphenation, indentation and whitespace, 7-bit character encodings, to name just a few, may cause tangible disturbance to the delivery of a new business system operating on textual content unless special care has been taken to address most of the idiosyncrasies inherent in the client’s data. In a sense, the system configured to “read” the text may very well fail to understand what the author “meant” when typing a given sequence of characters on their device or applying some formatting we didn’t really expect.

One might argue that, in the age of self-driving cars and talking machines, having some variety in the textual content shouldn’t deter a modern business system from being smart about such text. Well, let’s pause and think about what it is that makes systems text-smart.

The dreaded rules

Unless we’re at a Google kind of place, or building the coolest startup from scratch, chances are that machine learning is not our management’s weapon of choice. For a business system provider there are a few good reasons for that, such as the ability to explain to the client how things work and the need to ensure deterministic outcomes that wouldn’t otherwise aggravate someone who is just trying to do their job on the computer your client gave them. Rules seem to go down well with senior management, too, as even they may get to have a say in what would work in their organization, which is not the case when presented with the prospect of having to think of a document as a list of a million real numbers (most of which zero!) being fed to a black box outputting just a handful of real numbers on the other end telling them what the document is about or how angry the person writing it must have been.

Rules applied to text come in many different forms: from the simple if .. then .. else .. to dedicated pattern matching in the form of regular expressions or stacked finite-state automata, to name but a few. The idea of having a clean set of rules to do what we want in a fashion similar to what an HTML parser does for a web page or a compiler does for a piece of code is truly noble. However, while code is typically already written in a regular “language”, one that can only be produced by a finite set of rules and conversely lends itself to interpretation applying a subset of those rules, natural language is a biological phenomenon long predating the idea of feeding it to a limited machine for parsing. For a parser of a programming language the choice is fairly easy: if no possible parse tree can be inferred from the text, the code does not compute, and a syntax error is thrown since no “meaning” (or instruction) can be inferred from it. But for a reader of human language any text can carry some meaning, due to our incredible ability to adapt to errors and fill in the gaps in communication. Still, when processing human language automatically we can’t do much better than to try to infer a highly likely parse tree from a sequence of tokens. Such parse tree must be a well-formed one within the constraints of the grammar rules for the given language. Depending on the parsing technology, any input that is not strictly grammatical would either fail to produce a parse, or would result in some highly likely parse (according to our model) describing a grammatical sequence (again, such grammaticality fully at our parser’s discretion), that may or may not correspond to the interpretation of a human reader. Given that we’re typically not guarding against ungrammatical input in a business system, any inherently “ungrammatical” input is likely to remain under the radar in terms of attempting a rules engine on it.

One way to account for possible variations in a rules engine would be to try and capture the typical deviations from the strict rules and relax those a bit to allow them to fire also when, say, the user typed invoce instead of invoice. If we have an overview of the typical erroneous inputs, this should be doable. However, any errors we consider will be the idiosyncrasies of a limited number of people submitting such input to the business system. Any new user might introduce a totally new way of breaking the rule, at which time we should go back and revise the existing rules to accommodate also their errors. As we go along we will be evolving the rules engine from one that allows no variation whatsoever to one allowing an ever increasing amount of uncertainty. To make this work in practice we can’t stick to a fully deterministic output either, as a growing number of rules will start firing at the same time on the same input, requiring a judge to determine the most appropriate interpretation out of many possible ones, which is where automation starts to break down. To avoid that we may decide to introduce some measure of confidence or likelihood for each interpretation based on prior observations and context to allow us to choose the interpretation scoring the best. When doing so we should strive not to assign unnecessary preference to any possible interpretation since that would be pretty much the same as applying a strict rule. In formal terms that would mean reducing the bias introduced by hard-bounded rules and increasing the variance of the decision-making mechanism. These concrete measures are actually something one can optimize on when training a decision-making algorithm on observed data.

Wait, but I thought rules were rules and machine learning was magic

Well, no. Anything a machine learns from some data that it can later perform on unseen data can be mapped to a pattern matching rule of arbitrary complexity. Just like, armed with enough patience, one can write a monster of a regular expression to detect every occurrence of an action performed by Anna Karenina on any of the 800+ pages of the book, a machine can be trained to do the same based on a sufficient number of varied examples of such patterns occurring in similar texts. What the machine will come up with will be its own set of rules, infinitely more complex than the one we designed by hand, but such that would still do pattern matching over sequences of strings. Machine-learning algorithms can be debugged to explain the decisions taken at every step before the final outcome, so we can gain insight into the rules applied (although this is hardly feasible for a deep-learning algorithm where the human-readable representation of the data is lost already at the input layer). If we do so we might be surprised to find that the rules the machine came up with make very little sense in human terms. Yet, the outcome from applying the rules is usually comparable with the outcome from applying any handcrafted rule.

With this realization in hand we may be able to convince management that it’s not such a bad idea to try to support the decisions in our business system in a data-driven fashion and spare ourselves the labour of devising rules by hand that would never be able to capture all variation in any larger number of examples we have to work with. Assuming we have already amassed a database where human associates have identified the patterns we’re interested in in previous transactions, this should be a piece of cake. Lately I’ve been revisiting Philipp Koehn’s Statistical Machine Translation (an excellent study of the state of the art in phrase-based SMT and beyond on the eve of the deep-learning revolution in NLP). What strikes me time and time again is the clean-room nature of the examples used. The mathematics behind translating the German Er geht ja nicht nach Hause to He does not go home is mind-blowing, whether it’s computed within the erstwhile state-of-the-art framework of phrase-based statistical machine translation, or that of large-scale discriminative models. Yet it can only serve as an abstraction for the underlying set of rules we’re taught in school as long as the input truly makes sense within that set of rules. Any single spelling mistake, omission or duplication in the input may throw the system into a completely new state depending on the bias/variance balance of the model. For typos or unconventional use of punctuation that would often mean the words affected would likely be treated as out-of-vocabulary (OOV) words. Depending on the algorithms involved, for those special cases the system would typically either fall back on some special treatment, or their distribution will already have been accounted for in training by only considering the n most frequent words in the training data, and treating the rest as OOVs. Neither of which may be a good approximation for a spelling mistake in a really frequent vocabulary word, which is not that uncommon in business systems, in particular such that are in use across borders and languages.

One does not need to excel at spelling and grammar to be able to do their job and communicate effectively with clients. This is especially true for organizations where a shared language is not native for a sizeable part of the workforce. Raw text input generated in such organizations can present real challenges for any automation relying on the presence of specific signals in the text. Such signals are either inferred by handcrafted rules corresponding to our own understanding of the laws of language (such understanding assumed to be correct) or learned from the presence of specific features in training examples, most of which follow those same laws. Ungrammatical examples in the training data are either a minority that can have little effect for the learned parameters, or, if a majority, would typically exhibit a wide range of errors from which hardly any useful signal can be picked. Then, when presented with an unseen ungrammatical example, the system may lack the evidence necessary to fire any meaningful rule since the new example may be “wrong” in a completely novel manner. This property of free text input derives directly from the definition of written text as a single abstraction over an infinite number of possible variations of a given message originating as either speech or thought. Unless the laws of such abstraction are strictly applied, encoding the same message in the same written form twice could become elusive.

What can you do?

A case for standardization

I believe there is a need to ensure standardization across free-text inputs in the business system. Any approach to working with text will fall between two extremes: complete control over the input either by removing free-text fields altogether or by allowing only input from a controlled subset of the language, or completely free input. The former ensures regularity and would allow you to apply your own handcrafted rules to trigger the desired outcomes, yet it will hardly go down well with your users in the year 2019 and counting. Any attempted automation based on the latter may break down completely. Think of the email reply suggestions you’re getting from the world’s data-richest service, Google Mail; anything more specific than a Congratulations, I’m interested and the likes is simply too dangerous to suggest for fear of appearing silly. A mail service’s autoreply task is infinitely harder than that of a business system operating within the strict context of your line of business, that may trigger a limited amount of outcomes (such as escalating an issue, notifying the correct party or redirecting to the answer of a specific FAQ, to give some random examples). You may not even intend to use plain-text user input to trigger any hand-offs in your system automatically and instead would like to keep it as data supporting human decisions after the fact. Even so it would be wise if you could run simple aggregate functions on the data than having to read through each message to gather insight. Either way you’ll be better served by text that is as close as possible to a standard form on the surface, and such that includes the accepted terminology for your line of business. To ensure the best outcomes in such settings I’d like to propose something of a middle ground between fully controlled domain-specific regularized languages and uninhibited social-media-style input.

Motivating users to do the right thing

If we’re not to impose any hard restrictions on textual input maybe we could consider what it is that would otherwise incentivize users to adhere to a standard norm of writing. When we learn to write, generally this happens in a classroom environment with a more or less defined system of awards and penalties. Random deviations from the taught norm, if not penalized outright, are highlighted and corrected, while the lack of any teacher marks on your paper, with a “Well done!” token of encouragement at the bottom ensuring the work has actually been reviewed, is an unmistakable sign of success.

Think of how password strength control has evolved lately. The dry “the password must contain this and this, and be this long…” is but a bad memory of the 90s. Many modern services would instead colour your input accordingly so you get a true intuition of whether you’re doing okay, and would thus nudge you to choose a password that would keep you in the green. Given a good language model that takes little more than just the normalized free-text input one has on record to train, a similar tool could be deployed to indicate whether new inputs are sufficiently conforming to the model, and consequently would allow to be interpreted as actionable. Maybe a language model alone wouldn’t suffice so the business system will have to be augmented with further data-driven checks, such as fuzzy terminology matches. This, combined with autocomplete anywhere you can should definitely pull any free-text input towards a standard form that would lend itself more readily to data-driven analysis of any flavour.

Who’s responsible

At present, all this might seem irrelevant to business systems where input is being collected in free form for some reason but no data-driven techniques have been required to date. This does not mean that next year’s digital transformation initiative wouldn’t identify such a business system as ripe for optimization applying the latest in data-driven insights. It is not a matter of if but of when this should happen, and when it happens it is better to be prepared than to expect that the project team will be able to pull it off by some twist of data-science magic just given access to your stash of invaluable human input. Because actual text content likely won’t be the focus of any major software development effort you, as business-system owner, have got to help yourselves by minimizing the need for the implementation team to clean your text. Besides, you’re sitting closest to the originator of any such text both in space and time so it should be easiest for you to figure out what precautions to take in this respect. It is infinitely more difficult for a third party, not familiar with your line of business or your organization’s lingo, to standardize your own textual data long after the fact.

Concluding remarks

The title of this post is credited to a former colleague for whom I have the greatest respect. Since I have not obtained their permission to use their name in a personal blog, I’d rather not, until they’ve asked me to. The context was automatic extraction of text from tables in printed/scanned PDF files. Anyone who’s ever peeked into the raw text extracted from a PDF file would understand immediately the complexity of this issue since what is really appealing to the human eye on a well-typeset white sheet of paper could be utterly confusing to a machine treating any unstructured input strictly as a linear sequence of characters. To me, the same goes for just any free-text input. Just like no one expected to have to make it easier for a printed table to be translated back to a map-like data structure, regular users wouldn’t be striving to give you the sexy kind of input, a well-formed paragraph with uniform, standard use of spacing and newline characters, punctuation and capitalization that would look as good in your database as when printed on a fast-fashion item of clothing. Yet, this is the kind of text “smart” algorithms are trained on, so unless you’ve taken care to bring such input closer to the deemed norm at the source, chances are the success rate of such algorithms on your data would suffer. Besides, investing in proper text normalization at the source, even without an immediate need to do so, would benefit not only the latest in machine learning fads but also any kind of initiative at collecting insight from your text data both today and 10 years from now.