The day started with a great intro from Jana Eggers with a positive message about nurturing this AI baby that is being created rather than the doomsday scenario that is regularly spouted. We are a collaborative discipline of academia and industry and we can focus on how we use this for good.
First speaker of the day was Hugo Larochelle from Twitter. As you can image, with 310 million active users and 300M tweets per day, they have a lot of data. He opened with the point that deep learning required large amounts of compute and also large amounts of data – both of these are now satisfied. The deep learning community has embraced openness and papers are regularly up on arXiv freely available to anyone who wants to read them1. There are open source libraries so you don’t have to code from first principles any more and the whole discipline is mature in terms of diversity of ideas and discussion. All of this leads to a field with very fast growth. The requirement now is for data scientists to understand the problems and how to use the libraries rather than write the libraries themselves – we need to represent the data at its basic level. The Twitter Cortex group includes Whetlab and they’re focussing on capturing classification and captioning of some of the rich media posted to twitter. There was an impressive demo of animal videos and sports – is a tweet about a specific thing and can it be directed to the correct users? Hugo had a lovely visualisation of Twitter communities – what things do people talk about – and showed these connections down to the individual accounts. What was fantastic was that Twitter are open sourcing their distributed learning code2. Hugo’s vision of the future included: automation of machine learning – creating the architecture so that the focus was on the interaction between the experimenter and the experiment (paper link), what is the asymptotic performance of the model? Can we pick the best and train from those? He also postulated on lifelong learning and discussed infinite restricted Boltzman machines (paper link) where they could grow capacity through hidden layers as it learns, and that maybe we should teach AI in a human fashion rather than single tasks to expert levels…
The second talk was from Daniel McDuff from Affectiva. The company has moved on since he presented last year. Emotions affect how we live our lives, communicate etc and understanding them can help up design devices to improve our lives3. Manual labelling of videos for emotions is time consuming and different cultures have variations on expressiveness. Using an opt-in app, Affectiva were able to gather large amounts of data (hundreds of thousands of videos world wide). Even a smile can be ambiguous but CNNs perform the same as human tuned recognition, but for gender recognition, CNNs outperform human tuned networks. Generally people have neutral expressions and some emotions are very tricky – identifying sentimentality, or feeling informed, were a great example of this. Applications for the technology could include whether someone is active and engaged or just watching, which could be great feedback from distance learning for example, or even detecting depression.
The keynote was from Yoshua Bengion with a general talk on Deep Learning for AI. Devices are becoming more intelligent and computers need knowledge to make good decisions. With classical AI, the rules were fixes and rhe machines spoon fed with limited data. As humans, our knowledge is not formalised and is expressed in words – we have a lot of implicit knowledge. How do we get AI to give “intuition” from data mining? The data is not enough – need to get below the data to underlying patterns. Deep learning is better than other deep learning techniques and can capture high level abstractions. Face recognition is now at near human levels with recognition of edges, smaller concepts (e.g. eyes) and then full faces. AI won’t learn from random functions – neural networks works because they have mathematical properties. After some history in the advances in deep learning, Yoshua played a video of the Nvidia self-driving car – initially it made a lot of mistakes and the human had to take over to avoid critical accidents. Only a few months later, the same technology had advanced enough to be safe on the public road. He went on to discuss bidirectional RNNs to give faster training and better decision making – particularly useful in understanding images and generating English descriptions.
This has advanced a long way. By identifying the subject of the image and related actions and objects and placing this in context, grammatically correct sentences can be constructed e.g. the Frisbee can be identified as the object of interest in the image, it is close to a woman and the scene can be identified to give the sentence “A woman is throwing a Frisbee in a park”. This does not always work and some amusing bad examples were shown. Like everything in this field, the bad examples will not stay around for long. A further demo showed voice recognition and answering questions based on an image: where is the cat? What is the cat standing on? etc. Yoshua made the point that any 2 year old understands basic physics through observation and we need to give the same reasoning to machines. Also, we are getting a lot of data and some of the compute needs to be done locally so GPU hardware needs to be cheap and energy efficient. For high level abstraction, unsupervised learning is key – machines need to discover the abstractions from the data. Yoshua’s book will be out soon4.
After a quick coffee break, Tony Jebara from Neflix gave a talk on double-cover inference in deep belief networks Netflix are keen to get the right content to their subscribers – the catalogue is huge and growing – what to promote to each individual and even what to drop from the catalogue? The popularity of items is personalised based on each users viewing history – not only what they’ve watched but what sort of time of day they’ve watched it. This not only influences the order that items are presented but also the order that rows are presented. Users are categorised based on similar profiles based on viewer history and presented with the right content and image to identify that content at the right time for the specific user. Machine learning is used to refine the searches, select the catalogue and even for new subscribers with no history to ensure that the sign up process is smooth. The methods have changed over time from a simple 2D matrix of title and popularity to a temporal approach with feedback to reinforce the decisions and probabilities. They achieve this with positive and negative feedback in the deep belief network. However, it’s possible to get frustrated cycles emerging easily with oscillation problems and the network will not converge. To resolve this, if oscillations occur, they take the network and remove the repulsive links. They then make a complete copy of the network and add repulsive links only between clone nodes – this results in a more stable network without frustrated cycles.
Next up, Honglak Lee talked about disentangled representations – the need to tease apart the factors of variation, closely related to generalisation of learning. How can we recover the latent factors of variation? E.g. for faces we can consider the identity, pose and image features and can model the interaction between factors. With this model, Lee’s group have been able to hallucinate changes to facial expressions and change viewpoint from static 2D images. With weakly supervised learning they also managed to create 3D transformations from 2D images, this was demonstrated with chair rotations, although the same technology could be used for frame predictions and Lee further demonstrated how frame prediction could be used for video games e.g. pacman and Qbert as well as inferring a relation between 2 images and using this to manipulate an unknown image, which was demonstrated with game sprites. Further work included text to image synthesis and manipulating images as the content of the text changed (e.g. the colour of a bird in the picture. All very interesting work.
The next section was focussed on speech recognition where there are some really challenging problems. John Hershey from Mitsubishi discussed the cocktail party problem – how to focus on a specific voice in a sea of voices to determine whether the conversation is interesting. The vision analogy would be to have a large number of images superimposed with transparency – all the waveforms combining and causing confusion. With image labelling CNNs are well suited – there are nice constraints: contiguous local objects, one object per pixel, distinct set of object classes and don’t have to distinguish sub types unless there is a specific problem that requires it. For the cocktail party problem, you need to segment multiple instances e.g. each car in a picture as a distinct item, which is not easy. So why is this easy for humans? The spectrum is complex due to combination of sources, a single “pixel” can change over time. At Mitsubishi they are using mesh based speech separation with bidirectional LSTM RNNs. Classification doesn’t solve the problem so they have added clustering. John showed a demonstration with two voices and the output with different techniques applied, culminated in their latest version, where the single voice could be heard clearly. They have also done this for three voices and I can image that this could have great applications for various assistant applications in noisy environments.
Spyros Matsoukas was next to speak on the speech recognition problem discussing the technology behind Amazon Echo and Alexa. There are two sorts of products – those where you initiate listening through a button and those that are always listening waiting for a keyword. Echo has 7 microphones so it overcomes challenging acoustic conditions – once the direction of the activation sound has been identified this can be used to focus on the correct voice. Echo allows information to be streamed, respond to questions (via Alexa) and connect devices to it. There is echo cancellation to ignore Echo’s own output and listen for wake up words. Amazon are constantly expanding the capabilities by adding extra applications in the cloud. For example, “Alexa what’s the weather?” Alexa is the wake up word and is discounted, the rest of the sentence is sent to the cloud. The first step is speech recognition to convert the waveform to a form that can be passed through the NLP platform. At this step, inferences are added. In our example, no location was given so Alexa assumes the local weather. Once understood the request is passed to the correct skill (e.g. weather app) to get the results that are sent back with a text to speech directive. While this may seem straight forward, deep learning is used at each step – accents, homophones and ambiguously pronounced words can all cause problems It’s no surprise that Amazon deep learning use the AWS platform and have developed a distributed approach to share the load over multiple hosts so the limit is the bandwidth and not the processing. To keep bandwidth small they employ some interesting techniques (paper link): initially the data is passed through 1-bit quantisation, but the residual is kept for the next pass so the information is not lost, then a gradient threshold is applied and everything else is set to 0 and compressed. This gives a 1000x compaction and allows echo to remain personalised to each household. Amazon have open-sourced their deep learning software DSSTNE (pronounced Destiny5)
Andrew McCallum has been using deep learning to determine an open schema to allow questions to be asked while still retaining logical reasoning and structure in the data. He initially took job openings from companies to create a database and used this to predict trends. This work led into a knowledge based where unstructured data was processed to extract entity (author, relations, locations etc) and add structure before adding to the knowledgebase. Schemas are typically hand designed and can be too fine or too broad, or incomplete. For example”, affiliated with”, “studied at”, “professor at” are all associated with a university but do not have the same implication. A limited vector space can be used to created and open schema for entities and relations and fill in the gaps with logical inference. In the example he gave, they trained on a large knowledge base with cycles of data – by removing one of the links they could get the model to infer the final link. Once trained, the model could infer the schema with no human labelling. This has some great potential for removing the bottleneck in training and I look forward to reading a paper on this.
Next was Nathan Wilson from Nara Logics discussing the biological foundations for deep learning. This is a topic close to my heart as my own doctoral research was computation models of biological neurons and their adaptation in response to different neurotransmitters and resilience in network disruption. Nathan’s talk drew far more parallels between how biological neurons have inspired deep learning techniques and what we could see in the brain that was not yet used in computation as well as what features did good deep learning use that could predict how things might happen in biological neurons. I’m afraid I didn’t take many notes for this talk as it was all a warm comfortable feeling for me, but I encourage you to check out Nara Logics for more detail on what they do.
Facebook’s Adam Lerer continued the biological theme with a great talk on learning by example to get physical intuition. Toddlers learn balancing of blocks very early without needing to be “programmed” on the understanding of physics – they can spot precarious situations and determine what happens next by experience of the real world. Could this be achieved with AI? They built a physics engine and used it to predict the fall of randomly distributed blocks. Can deep learning networks learn physics from observation of videos? Yes! But how generalised can it be? They used a game engine to create synthetic data and created real videos as a test set. The required output was a decision of whether the blocks would fall and if so, where would they land? Using a mesh network PhysNet they split the prediction into a binary output for the fall and a location mask for the final position. Wiuth the synthetic data, PhysNet outperforms humans by far, with real data, PhysNet on slightly underperforms compared to humans. They also tested on a different number of blocks to what was used for training and found that although performance degrades, it still outperformed humans on the task.
After a quick coffee break we had a panel focussing on making deep learning as impactful as possible in the near term. There was general consensus that momentum was building for application of deep learning to the life sciences, particularly medicine. In robotics, the rewards are spare and labelled data is lacking, so need data set generators. Collaboration with academia is key. The panel suggested that large scale simulations would be necessary for items that cannot be trained in the real world. We also need to understand the limitations of technology – there is much more human-machine interaction and the challenge is to make that interface as easy as possible to allow humans to do their jobs faster. A question left hanging was whether we would ever trust a machine all the way through to medical diagnosis…6. Deep learning is almost commoditised, although there are problems where the number of data points is far larger than the available sample set (e.g. the human genome) so abstractions are required. Right now, the golden nail is to get the right data to solve the problems, but it cannot solve all problems. There could be something better than deep learning in the future…
The next speaker was Urs Koster from Nervana systems who have created their own optimised hardware as a cloud service with their own environment to create the deep learning models (Neon). Results were impressive – 4x faster than CAFFE for training and 2x faster than Nvidia’s TitanX. Dennard scaling has ended and the only increase in power for GPUs is via transistors in the last 10 years. The bottleneck is the communication between the chip so they added multipliers in local memory and created a custom interconnect in a 3D torus. This latest hardware they are about to launch is 5x faster than Nvidia, but with the launch of the new 1080 GPU in the last few days I wonder if this advantage will stay?
Key to putting deep learning on local devices is power consumption. Where you cannot afford the latency (or do not want to send data) to the cloud for processing, the local device needs to be powerful enough to manage the decision making, while still being energy efficient to avoid battery drain. Vivienne Sze and Yu-Hsin Chen from MIT presented their work on building energy-efficient accelerators for deep learning. Using an example of image recognition, they showed that each layer in the network required large amounts of computation and data transmission – to get efficiency requires parallising operations and reusing data and filters. Distributing the register and controls to each ALU also provides large increases in efficiency. Using DRAM is 200x more energy expensive than hoaving it in local memory. Zero compression can further save power by 45%. MIT have created Eyeriss DCNN accelerator system. This can run AlexNet 35 fps at 278 mW and this can be dropped below 100 mW by halving the frame rate. At full frame rate, the energy use is not far off normal hardware requirements, showing that there is still far further to go, but this might be the beginning on the next revolution in chip design.
This ended the first day, with another great line up of speakers for day 2, there was a lot to think about.
- I’m a big advocate that science should be freely available to everyone and not hidden behind pay barriers – this archive is a fantastic resource – go get lost in the rabbit hole for a few days reading it 🙂 ↩
- link will be added when I have it. ↩
- Last year he was adamant that we should be able to hide our emotions from computers if we wanted to. I’d be curious to know if he still believed this but sadly there wasn’t time for my question ↩
- link to follow ↩
- I feel like I want to make a joke from The Core here but won’t… ↩
- and given the healthcare and litigation system in the US no doubt that question wouldn’t be resolved until everyone knows who is liable if the machine gets it wrong… ↩