Performance Evaluation in Machine Learning:The Good, The Bad, The Ugly and The Way Forward
"This paper gives an overview of some ways in which our understanding of performance evaluation measures for machine-learned classifiers has improved over the last twenty years. I also highlight a range of areas where this understanding is still lacking, leading to ill-advised practices in classifier evaluation. This suggests that in order to make further progress we need to develop a proper measurement theory of machine learning. I then demonstrate by example what such a measurement theory might look like and what kinds of new results it would entail. Finally, I argue that key properties such as classification ability and data set difficulty are unlikely to be directly observable, suggesting the need for latent-variable models and causal inference."
machine-learning  evaluation  measurement 
8 days ago by arsyed
[1901.11373] Learning and Evaluating General Linguistic Intelligence
We define general linguistic intelligence as the ability to reuse previously acquired knowledge about a language's lexicon, syntax, semantics, and pragmatic conventions to adapt to new tasks quickly. Using this definition, we analyze state-of-the-art natural language understanding models and conduct an extensive empirical investigation to evaluate them against these criteria through a series of experiments that assess the task-independence of the knowledge being acquired by the learning process. In addition to task performance, we propose a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task. Our results show that while the field has made impressive progress in terms of model architectures that generalize to many tasks, these models still require a lot of in-domain training examples (e.g., for fine tuning, training task-specific modules), and are prone to catastrophic forgetting. Moreover, we find that far from solving general tasks (e.g., document question answering), our models are overfitting to the quirks of particular datasets (e.g., SQuAD). We discuss missing components and conjecture on how to make progress toward general linguistic intelligence.
evaluation  nlp  nlu 
9 days ago by arsyed
Questions for a new technology.
"Given that coordination and communication swamp all other costs in modern software development it is a pressing area to invest in, especially as your team scales."
development  evaluation  management  technology  business  questions 
9 days ago by garrettc
Questions for a new technology. | Kellan Elliott-McCrea
Good questions to ask yourself or your team before jumping on The New Thing. Like Dr. Wave’s metaphor of a nail in the head: some things are more painful to change than to live with (and constant change is even worse) so adopting the new thing must be done judiciously.
technology  adoption  evaluation 
12 days ago by dlkinney
My alma mater is seeking a consultant (based anywhere) to conduct an of its…
evaluation  from twitter
13 days ago by sdp
1. What problem are we trying to solve?
2. How could we solve the problem with our current tech stack?
3. Are we clear on what new costs we are taking on with the new technology?
4. What about our current stack makes solving this problem in a cost-effective manner difficult?
5. If this new tech is a replacement for something we currently do, are we committed to moving everything to this new technology in the future?
6. Who do we know and trust who uses this tech? Have we talked to them about it? What did they say about it? What don’t they like about it?
7. What’s a low risk way to get started?
8. Have you gotten a mixed discipline group of senior folks together and thrashed out each of the above points? Where is that documented?
via:swillison  advise  evaluation  software 
14 days ago by leeomara
"7GUIs defines seven tasks that represent typical challenges in GUI programming. In addition, 7GUIs provides a recommended set of evaluation dimensions."
programming  guis  frameworks  evaluation  dopost 
20 days ago by niksilver
[1805.01070] What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing. "Downstream" tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficult to infer what kind of information is present in the representations. We introduce here 10 probing tasks designed to capture simple linguistic features of sentences, and we use them to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of both encoders and training methods.
embeddings  evaluation 
27 days ago by foodbaby

