I’ve taken longer than I normally would to respond to some recent news stories about AI “outperforming humans” in reading comprehension “for the first time”. Partly because I can’t help the wave of annoyance that fills me when I see articles so obviously designed to instil panic and/or awe in the reader without any detail, but also because I feel it’s important to do some primary research before refuting anything1. The initial story broke that an AI created by Alibaba had met2 the human threshold in the Stanford Question Answering Dataset (SQuAD) followed closely by Microsoft outperforming Alibaba and exceeding the human score (slightly). Always a safe bet for sensationalism, mainstream media pounced on the results to announce millions of jobs are at risk…. So what’s really going on?
There have been a few sites who have started the debunk of the sensationalism with what appears to be some copy-paste journalism and I’m not sure on which came first. They all seem to be making the same points: SQuAD is biased towards machines, using minimally paid humans where English may not be their first language as a benchmark may not be appropriate, and SQuAD high score are not representative of natural reading.
Let’s understand SQuAD first and why it is biased towards machines from the original paper. The dataset was created by crowd sourcing questions on paragraphs of text from Wikipedia and then asking a different set of people to answer the questions by highlighting the answer in the text. The questions could be free form but the answers were fixed to the displayed text. This is a slightly unnatural way of asking questions as it forces phrasing that directly follows from the text rather than comprehension. Since AI is very good at looking at searching words quickly, it has a natural advantage on finding the correct part of the paragraph and searching outwards from that point for the answer it needs. The human accuracy was produced by merging the crowdsourced results and taking a third as prediction and the remaining two thirds as ground truth. This is a measure of the quality of the data set and not a good representation of human reading. Since the AI is measured against the human standard (and also trained against the human standard) inaccuracies in this standard make the AI results less reliable.
When you look at the breakdown of question types, they do segment into simple “what”, “when”, “where”, “who”, “how”, “how many”, so while not a trivial task to get these scores, it doesn’t require any level of comprehension.
A further problem is the wondrous ambiguity of language. While there are rules for language, we break them all the time. Our spoken phrasing differs from our formal writing and differs again from our informal writing, particularly when condensed into snappy short forms on the internet. The Wikipedia articles on which the questions were based were the top 1000 pages, this will have converged over time to be very clear formal English4. The human brain is fantastic at finding meaning from language, even if we have to keep seemingly unrelated items in our heads to disambiguate “that”, “it”, “they” in follow on sentences. Machines are currently really bad at this. While some lexicon substitution was applied to the data set and some ambiguity was introduced with questions and answers across multiple sentences, this was relatively straight forward to counter.
To be able to understand language as we do, you can’t just look at sentence structure. You need context of the words so when you see words like “tears” you know whether I’m talking about crying or ripping something. You need to know what’s usual and unusual for an object so “do not chain bicycles to these railings as they may be removed” you know that “they” means the bicycle as railings rarely move. British English5 is wonderfully varied. We have some atrociously complex spelling and grammar rules. Most of the time we don’t even know why we speak in the way we do.
The order of adjectives – rules we follow without knowing… pic.twitter.com/24NQtyht0y
— S J Watson (@SJ_Watson) January 2, 2017
Neither the Alibaba or Microsoft results are indicating that AI can read natural language better than a human, but as an information retrieval system to get information quickly from a large volume of text then these are pretty good advances. We’ll see AI assistants, both voice and text driven getting better. I’m yet to see a huge innovation in this area, but it’s great to see regular gradual improvements. Like any public data set with a leader board, how the system performs with real questions and answers and evolving language is far more of interest to me.
We’re still a long way off logical reasoning based on text6 and the ability to infer correctly when language is ambiguous, or understand when questions really do need clarification.
Thinking on it, maybe we should set a new standard using the GMAT – if an AI can pass the GMAT then we will have made a good step towards artificial comprehension.
- As usual I implore you not go spreading things around, even from trusted sources without checking out the facts yourself. ↩
- Statistically rather than looking at the exact score. ↩
- Graduate Management Admissions Test, used as a standard for entry onto an MBA course. I’ve taken it, it was quite fun. ↩
- Mainly because there is a whole sub culture of people who want editing points and will pick up on anything that needs correction… ↩
- I only make the distinction here because I can’t comment from experience on other variants of English ↩
- But then I could name a lot of humans who are equally far away from that point as well 🙂 ↩