evaluation   9060

« earlier    

Performance Evaluation in Machine Learning:The Good, The Bad, The Ugly and The Way Forward
"This paper gives an overview of some ways in which our understanding of performance evaluation measures for machine-learned classifiers has improved over the last twenty years. I also highlight a range of areas where this understanding is still lacking, leading to ill-advised practices in classifier evaluation. This suggests that in order to make further progress we need to develop a proper measurement theory of machine learning. I then demonstrate by example what such a measurement theory might look like and what kinds of new results it would entail. Finally, I argue that key properties such as classification ability and data set difficulty are unlikely to be directly observable, suggesting the need for latent-variable models and causal inference."
machine-learning  evaluation  measurement 
8 days ago by arsyed
[1901.11373] Learning and Evaluating General Linguistic Intelligence
We define general linguistic intelligence as the ability to reuse previously acquired knowledge about a language's lexicon, syntax, semantics, and pragmatic conventions to adapt to new tasks quickly. Using this definition, we analyze state-of-the-art natural language understanding models and conduct an extensive empirical investigation to evaluate them against these criteria through a series of experiments that assess the task-independence of the knowledge being acquired by the learning process. In addition to task performance, we propose a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task. Our results show that while the field has made impressive progress in terms of model architectures that generalize to many tasks, these models still require a lot of in-domain training examples (e.g., for fine tuning, training task-specific modules), and are prone to catastrophic forgetting. Moreover, we find that far from solving general tasks (e.g., document question answering), our models are overfitting to the quirks of particular datasets (e.g., SQuAD). We discuss missing components and conjecture on how to make progress toward general linguistic intelligence.
evaluation  nlp  nlu 
9 days ago by arsyed
Questions for a new technology.
"Given that coordination and communication swamp all other costs in modern software development it is a pressing area to invest in, especially as your team scales."
development  evaluation  management  technology  business  questions 
9 days ago by garrettc
Questions for a new technology. | Kellan Elliott-McCrea
Good questions to ask yourself or your team before jumping on The New Thing. Like Dr. Wave’s metaphor of a nail in the head: some things are more painful to change than to live with (and constant change is even worse) so adopting the new thing must be done judiciously.
technology  adoption  evaluation 
12 days ago by dlkinney
Twitter
My alma mater is seeking a consultant (based anywhere) to conduct an of its…
evaluation  from twitter
13 days ago by sdp
Questions for a new technology. | Kellan Elliott-McCrea
1. What problem are we trying to solve?
2. How could we solve the problem with our current tech stack?
3. Are we clear on what new costs we are taking on with the new technology?
4. What about our current stack makes solving this problem in a cost-effective manner difficult?
5. If this new tech is a replacement for something we currently do, are we committed to moving everything to this new technology in the future?
6. Who do we know and trust who uses this tech? Have we talked to them about it? What did they say about it? What don’t they like about it?
7. What’s a low risk way to get started?
8. Have you gotten a mixed discipline group of senior folks together and thrashed out each of the above points? Where is that documented?
via:swillison  advise  evaluation  software 
14 days ago by leeomara
7GUIs
"7GUIs defines seven tasks that represent typical challenges in GUI programming. In addition, 7GUIs provides a recommended set of evaluation dimensions."
programming  guis  frameworks  evaluation  dopost 
20 days ago by niksilver
[1805.01070] What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing. "Downstream" tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficult to infer what kind of information is present in the representations. We introduce here 10 probing tasks designed to capture simple linguistic features of sentences, and we use them to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of both encoders and training methods.
embeddings  evaluation 
27 days ago by foodbaby

« earlier    

related tags

#tsatmc  +++--  ++---  19jam  2014  accessibility  adoption  advise  advocacy  ai  alzheimers  analysis  analytics  android  annual  apc  ar  architecture  art  articles  assesment  assessment  audio  bbc  benchmarking  bitwarden  bleu  browser  bullsi  business  c#  callback  career  carol_black  check  children  choice  cities  classification  classifier  client  cnn  code  collaboration  columbia  community  comparison  computer-vision  construction  costs  cpu  crime  cs  culture  data  datastructure  deep-learning  deep  dependencies  dependency  design  detection  development  dfid  digitalcuration  diversity  diy  domination  dopost  dsp  education  electronics  embeddings  employee  encryption  entrepreneurs  environment  ethics  europe  evaluation-measures  eventbus  evidence  example  expression  fido  fit  floss  flow  foreignaid  forms  framework  frameworks  functional  future  gatesfoundation  gaze  generator  genetics  golang  goldstandard  goodness-of-fit  goodness  guis  health  heuristicevaluation  heuristics  history  hr  humanities  ifttt  impact  impostor  infinite  innovation  international  intersection-over-union  iou  ir  iso  issue  iterator  j-pal  java  javascript  jigsaw  jk-b7  jk-career  journal  jpal  js  kenya  label  language-models  laptops  lastpass  lazy  learning  libraries  library  libs  listener  lstm  machine-learning  machine-translation  machine  machine_learning  management  manager  mastery  math  mcfadden  measure  measurement  mentor  merit  methods  metrics  middlebox  miou  mir  mixed-methods  ml-interpretability  ml  mobility  model  mooresville  mt  multi  multinomial  museumed  music  need  netpromoter  network  neural  newschoolsventurefund  nicolapitchford  nlg  nlp  nlu  nmd445  nonprofit  northcarolina  nsvf  of  offsetting  olpc  open_access  opensource  paper  participatoryactionresearch  password  performance  philosophy  plugin  pocket  police  policy  presidency  president  privacy  problems  programming  promotion  public  publicspace  python  questionnaire  questions  r-squared  r2  ranking  rating  react  recommendation  recruitment  recsys  refactoring  reference  regression  research  review  rfps  rubric  runtime  safety  savehakeem  schooling  score  screens  screentime  security  service  sesamestreet  similarity  skills  sna  social_value  software  springer  ssl  standards  startup  startups  statistics  statistiques  strategy  study  stylometry  summarization  surveillance  survey  sus  sustainability  syndrome  teaching  teachingandlearning  technology  test  testing  text  tidy  tls  tools  tradeoff  translation  trump  trust  tutorial  type:company  uaf  uk  umux  urbanism  usability  utilization  utm  ux  vox  vr  vue  whitepapers  wn  women  youthengagement 

Copy this bookmark:



description:


tags: