Tokenization—the least interesting #NLProc topic? Hell no! We, members of the @BigScienceW tokenization group are proud to present:
✨Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP✨ arxiv.org/abs/2112.10508
What's in it? [1/10] @BigscienceW We start by examining the theoretical and linguistic foundation of trying to identify discrete units of language (§2), leading into "old-school" tokenization, these days more often called pretokenization (§3). Are words really obvious? Oh no they aren't... [2/10]
Nov 26, 2020 • 15 tweets • 17 min read
I finally watched all the talks I wanted to, ended up importing 56 papers to my bib, and now present to you:
🎉 My 13 favorite papers (sorted alphabetically) at #EMNLP2020! 🔥
My first #ICML2020 was different from my n-th #acl2020nlp, but, or perhaps because of that, I did try to look for interesting papers that I could relate to but that might still teach me something new!
Papers, in roughly chronological order---each with a short summary :) [1/42]
“How Good is the Bayes Posterior in Deep Neural Networks Really?” (Florian Wenzel/@flwenz, Kevin Roth, @BasVeeling, Jakub Swiatkowsk, Linh Tran, @s_mandt, @JasperSnoek, @TimSalimans, @RJenatton, Sebastian Nowozin)
With @iclr_conf#ICLR2020 over and a bit of sleep under my belt, I'd like to give my short summary of a truly great event---and offer a list of the papers I enjoyed seeing (for those who are into that kind of thing).
In general, I feel lucky to live in a time where we have venues like these full of really interesting papers on the intersection between NLP and ML (and others, but that's what I personally am most into, so my experience is biased).