16346
[1810.13409] You May Not Need Attention
In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decoder. Our eager translation model is low-latency, writing target tokens as soon as it reads the first source token, and uses constant memory during decoding. It performs on par with the standard attention-based model of Bahdanau et al. (2014), and better on long sentences.
attention 
10 days ago
[1811.03600] Measuring the Effects of Data Parallelism on Neural Network Training
Recent hardware developments have made unprecedented amounts of data parallelism available for accelerating neural network training. Among the simplest ways to harness next-generation accelerators is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured in the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike. We study how this relationship varies with the training algorithm, model, and dataset and find extremely large variation between workloads. Along the way, we reconcile disagreements in the literature on whether batch size affects model quality. Finally, we discuss the implications of our results for efforts to train neural networks much faster in the future.
batch-size 
11 days ago
« earlier      
20090622 2_visit ab ab-testing ai airlines airlines-flights analysis angularjs-vs architecture art asia async auckland audio aws aws-lambda banking bayesian benchmark blog blogging blogging_software blogs books bpamp burma business cache cambodia cassandra china cnn code community comparative_foreign_policy computing courses_2005fc critique culture data database database-theory deployment design development dnn docker download downloads/software economics ecs education email embeddings emr envoy eu europe evaluation example experience facebook fbwall finance flights forex free freeware friends gc gis golang google gps guide hardware hash hiring history hive howto imported individual_articles indonesia interest international international_phone_calling internet internet_applications interpretability investing ir java javascript-mvc job jobs jobs/study/professional_dev jvm kafka kubernetes learning library linux local_web ltr mail management maps media memory metrics microfinance microsoft microsoft_word ml mobile money monitoring mp3 mp3_players music network networking neural news nginx nio nlp nz online opensource optimization overview p2p papers parquet perf perf-testing-theory performance philippines philosophy phone politics presto production productivity programming prop_trading_systems psychology python recipes recsys reference relevance research resources reviews rstats rust s3 scala science search security serverless shopping skype slides social sociology sociology_of_media software solr spark statistics stats strategy study symantec technology tensorflow testing text theory tips tools torrents trading trading_systems travel tutorial ubuntu utilities video visualization voip vs web web2.0 windows windows_xp/2003 word word2vec wordpress wordpress_wp_plugins writing xbmc

Copy this bookmark:



description:


tags: