Machine Learning on Source Code (#MLonCode) is a new cool domain of Machine Learning which takes source code as an input data. Vadim tells the story about capturing the code naturalness, one of the core concepts in MLonCode, through identifier (class, function, variable name) embeddings. Embedding millions of identifiers is challenging... watch how this problem was solved.
This does not follow the actual video word by word but was used as a reference: http://vmarkovtsev.github.io/dotai-2018-paris/script.txt
A few people asked after the talk, what is the killer feature of representation learning on explicit co-occurrence matrix compared to regular word2vec? It is the ability to scale with the vocabulary size rather than with the dataset size. It is possible to choose the first N most frequent items by cropping the co-occurrence matrix and train Swivel on them. According to our experience, word2vec is preferable for relatively small datasets (say, up to 50GB) and then Swivel takes over.