NLP's ImageNet moment has arrived
Word2vec and related methods are shallow approaches that trade expressivity for efficiency. Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words. This is the core aspect of language understanding, and it requires modeling complex language phenomena such as compositionality, polysemy, anaphora, long-term dependencies, agreement, negation, and many more. It should thus come as no surprise that NLP models initialized with these shallow representations still require a huge number of examples to achieve good performance.

In NLP, models are typically a lot shallower than their CV counterparts. Analysis of features has thus mostly focused on the first embedding layer, and little work has investigated the properties of higher layers for transfer learning. Let us consider the datasets that are large enough, fulfilling desideratum #1. Given the current state of NLP, there are several contenders.

Language modeling (LM) aims to predict the next word given its previous word. Existing benchmark datasets consist of up to 1B words, but as the task is unsupervised, any number of words can be used for training. See below for examples from the popular WikiText-2 dataset consisting of Wikipedia articles.

In light of this step change, it is very likely that in a year’s time NLP practitioners will download pretrained language models rather than pretrained word embeddings for use in their own models, similarly to how pre-trained ImageNet models are the starting point for most CV projects nowadays.
nlp  deeplearning 
1 hour ago
It was raining in the data center
“Although the actual paths of fiber-optic cables are considered state and company secrets, it is not unlikely that most or all of the Facebook facility’s data runs along this route. In The Prehistory of the Cloud, Tung-Hui Hu describes the origin of private data service with telecommunications giant Sprint (Southern Pacific Railroad Internal Network), which sold excess fiber-optic bandwidth along train lines to consumers beginning in 1978. He goes on to state in the same text, that “virtually all traffic on the US Internet runs across the same routes established in the 19th century”.”
military  geography  usa  oregon  internet  infrastructure  facebook 
4 days ago
The Key to Everything | by Freeman Dyson | The New York Review of Books
Freeman Dyson MAY 10, 2018 ISSUE
Scale: The Universal Laws of Growth, Innovation, Sustainability, and the Pace of Life in Organisms, Cities, Economies, and Companies
by Geoffrey West
Penguin, 479 pp., $30.00
maths  astronomy  science  review  biology  complexity  book 
4 days ago
David Foster Wallace on John McCain, 2000 Rolling Stone Story – Rolling Stone
By all means stay home if you want, but don’t bullshit yourself that you’re not voting. In reality, there is no such thing as not voting: you either vote by voting, or you vote by staying home and tacitly doubling the value of some Diehard’s vote.
politics  usa 
5 days ago
[Easy Chair] | Forget About It, by Corey Robin | Harper's Magazine
“Ever since the 2016 presidential election, we’ve been warned against normalizing Trump. That fear of normalization misstates the problem, though. It’s never the immediate present, no matter how bad, that gets normalized — it’s the not-so-distant past. Because judgments of the American experiment obey a strict economy, in which every critique demands an outlay of creed and every censure of the present is paid for with a rehabilitation of the past, any rejection of the now requires a normalization of the then.”

“Whenever I said this, people got angry with me. They still do. For months, now years, I puzzled over that anger. My wife explained it to me recently: in making the case for continuity between past and present, I sound complacent about the now. I sound like I’m saying that nothing is wrong with Trump, that everything will work out. I thought I was giving people a steadying anchor, a sense that they — we — had faced this threat before, a sense that this is the right-wing monster we’ve been fighting all along, since Nixon and Reagan and George W. Bush. Turns out I was removing their ballast, setting them afloat in the intermittent and inconstant air.”
politics  usa  republican  history 
26 days ago
Traffic Jam? Blame 'Induced Demand.' - CityLab
In urbanism, “induced demand” refers to the idea that increasing roadway capacity encourages more people to drive, thus failing to improve congestion.
Since the concept was introduced in the 1960s, numerous academic studies have demonstrated the existence of ID.
But some economists argue that the effects of ID are overstated, or outweighed by the benefits of greater automobility.
Few federal, state, and local departments of transportation are thought to adequately account for ID in their long-term planning.

Many departments of transportation are instead touting the benefits of toll lanes, a more au courant form of roadway capacity expansion.

Such pricing tools can help mitigate induced demand, but these, too, come with their own negative externalities. Tolls, and ever-elusive congestion pricing schemes have been criticized for being a regressive form of taxation that is spread among high- and low-income drivers alike. The real solution to induced demand could be freeway removal—call it reduced demand—which has been shown to reduce auto traffic while also stimulating new development.
urbanism  transport 
26 days ago
The Annotated Transformer
The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks. The paper itself is very clearly written, but the conventional wisdom has been that it is quite difficult to implement correctly.

In this post I present an “annotated” version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation. In total there are 400 lines of library code which can process 27,000 tokens per second on 4 GPUs.
nlp  deeplearning  pytorch 
26 days ago
How do we capture structure in relational data?
The key insight behind the DeepWalk algorithm is that random walks in graphs are a lot like sentences.

Grover and Leskovec (2016) generalize DeepWalk into the node2vec algorithm. Instead of “first-order” random walks that choose the next node based only on the current node, node2vec uses a family of “second-order” random walks that depend on both the current node and the one before it.

Under the structural hypothesis, nodes that serve similar structural functions — for example, nodes that act as a hub — are part of the same neighborhood due to their higher-order structural significance.

For instance, a user’s graph of friends on a social network can grow and shrink over time. We could apply node2vec, but there are two downsides.

It could be computationally expensive to run a new instance of node2vec every time the graph is modified.

Additionally, there is no guarantee that multiple applications of node2vec will produce similar or even comparable matrices .

Node2vec and DeepWalk produce summaries that are later analyzed with a machine learning technique. By contrast, graph convolutional networks (GCNs) present an end-to-end approach to structured learning.
graph  machinelearning  deeplearning 
29 days ago
Software 2.0 – Andrej Karpathy – Medium
It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data (or more generally, identify a desirable behavior) than to explicitly write the program. In these cases, the programmers will split into two teams. The 2.0 programmers manually curate, maintain, massage, clean and label datasets; each labeled example literally programs the final system because the dataset gets compiled into Software 2.0 code via the optimization. Meanwhile, the 1.0 programmers maintain the surrounding tools, analytics, visualizations, labeling interfaces, infrastructure, and the training code.

Software 2.0 is: Computationally homogeneous. It is much easier to make various correctness/performance guarantees.

Simple to bake into silicon. As a corollary, since the instruction set of a neural network is relatively small, it is significantly easier to implement these networks much closer to silicon, e.g. with custom ASICs, neuromorphic chips, and so on.

Constant running time. Every iteration of a typical neural net forward pass takes exactly the same amount of FLOPS. There is zero variability based on the different execution paths your code could take through some sprawling C++ code base.
Constant memory use. Related to the above, there is no dynamically allocated memory anywhere so there is also little possibility of swapping to disk, or memory leaks that you have to hunt down in your code.

It is highly portable. A sequence of matrix multiplies is significantly easier to run on arbitrary computational configurations compared to classical binaries or scripts.
in Software 2.0 we can take our network, remove half of the channels, retrain, and there — it runs exactly at twice the speed and works a bit worse.

Finally, and most importantly, a neural network is a better piece of code than anything you or I can come up with in a large fraction of valuable verticals, which currently at the very least involve anything to do with images/video and sound/speech.

The 2.0 stack can fail in unintuitive and embarrassing ways ,or worse, they can “silently fail”, e.g., by silently adopting biases in their training data, which are very difficult to properly analyze and examine when their sizes are easily in the millions in most cases.

Finally, we’re still discovering some of the peculiar properties of this stack. For instance, the existence of adversarial examples and attacks highlights the unintuitive nature of this stack.

When the network fails in some hard or rare cases, we do not fix those predictions by writing code, but by including more labeled examples of those cases. Who is going to develop the first Software 2.0 IDEs, which help with all of the workflows in accumulating, visualizing, cleaning, labeling, and sourcing datasets? Perhaps the IDE bubbles up images that the network suspects are mislabeled based on the per-example loss, or assists in labeling by seeding labels with predictions, or suggests useful examples to label based on the uncertainty of the network’s predictions.
machinelearning  workbench 
4 weeks ago
Intuitively Understanding Variational Autoencoders – Towards Data Science
For example, training an autoencoder on the MNIST dataset, and visualizing the encodings from a 2D latent space reveals the formation of distinct clusters. This makes sense, as distinct encodings for each image type makes it far easier for the decoder to decode them. This is fine if you’re just replicating the same images.

But when you’re building a generative model, you don’t want to prepare to replicate the same image you put in. You want to randomly sample from the latent space, or generate variations on an input image, from a continuous latent space.

Variational Autoencoders (VAEs) have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation.

It achieves this by doing something that seems rather surprising at first: making its encoder not output an encoding vector of size n, rather, outputting two vectors of size n: a vector of means, μ, and another vector of standard deviations, σ.

What we ideally want are encodings, all of which are as close as possible to each other while still being distinct, allowing smooth interpolation, and enabling the construction of new samples.

In order to force this, we introduce the Kullback–Leibler divergence (KL divergence[2]) into the loss function. The KL divergence between two probability distributions simply measures how much they diverge from each other. Minimizing the KL divergence here means optimizing the probability distribution parameters (μ and σ) to closely resemble that of the target distribution.

Intuitively, this loss encourages the encoder to distribute all encodings (for all types of inputs, eg. all MNIST numbers), evenly around the center of the latent space. If it tries to “cheat” by clustering them apart into specific regions, away from the origin, it will be penalized.

Now, using purely KL loss results in a latent space results in encodings densely placed randomly, near the center of the latent space, with little regard for similarity among nearby encodings. The decoder finds it impossible to decode anything meaningful from this space, simply because there really isn’t any meaning.

Optimizing the two together, however, results in the generation of a latent space which maintains the similarity of nearby encodings on the local scale via clustering, yet globally, is very densely packed near the latent space origin (compare the axes with the original).
4 weeks ago
The philosophical argument for using ROC curves – Luke Oakden-Rayner
Prevalence is the ratio of positive to negative examples in the data. To remove prevalence from consideration, we simply need to not compare these two groups; we need to look at positives and negatives separately from each other.

The ROC curve achieves this by plotting sensitivity on the Y-axis and specificity on the X-axis.

Sensitivity is the ratio of true positives (positive cases correctly identified as positive by the decision maker) to the total number of positive cases in the data. So you can see, it only looks at positives.

Specificity has the same property, being the ratio of true negatives to the total number of negatives. Only the negatives matter.

The ROC curve shows the probability of making either sort of error (false positive or false negative) as a curved trade-off. As I mentioned in the last post, this means that a better decision maker will have a curve up and to the left of a worse one.

I like to call this the expertise of a decision maker because it seems to match our intuition of how human expertise is distributed*. Inexperienced humans seem to occur lower and to the right of more experienced humans, in any given task. Equally experienced humans seem to define a curve in ROC space, because while they operate at different thresholds they have the same overall capability.

This property can therefore be measured very nicely by the area under the curve (AUC). The AUC roughly describes the total distance the curve is in the up-left direction, across every possible threshold. This means that AUC is invariant to prevalence, and also invariant to threshold.

These two ROC curves, A and B, have the same area under the curve. But if you are picking a threshold, you want to know where the steepest and flattest parts of the curve start and stop. As the source of the above picture states, curve A is good for ruling in a disease. This is because you want low false positives, and curve A is very steep at the bottom left, meaning you can achieve a decent sensitivity while maintaining a very low FPR. Curve B is good at ruling out a disease, because you can have very high sensitivity while having moderate specificity.

A PR curve plots precision on the y-axis and recall on the x-axis. The first thing to recongise here is that ROC curves and PR curves contain the same points – a PR curve is just a non-linear transformation of the ROC curve. The contain the same information, all that differs is how interpretable they are.

The problem with this is that we no longer isolate expertise, since one of the metrics varies with prevalence. This is best shown by looking at the area under the curve: while ROC curves are monotonic (meaning they always go up and right, never turning down or left at any region), PR curves are not. Thus the curves don’t have this clear relationship where moving up+left = expertise. Instead they can look like pretty much anything.
statistics  machinelearning 
4 weeks ago
Planet parts: Global data streams
Near-realtime Earth observation resources
earth  climatechange  weather  data 
4 weeks ago
Dark Motives and Elective Use of Brainteaser Interview Questions - Highhouse - - Applied Psychology - Wiley Online Library
Brainteaser interview questions such as “Estimate how many windows are in New York” are just one example of aggressive interviewer behaviour that lacks evidence for validity and is unsettling to job applicants. This research attempts to shed light on the motives behind such behaviour by examining the relation between dark‐side traits and the perceived appropriateness of brainteaser interview questions. A representative sample of working adults (n = 736) was presented with a list of interview questions that were either traditional (e.g., “Are you a good listener?”), behavioural (e.g., “Tell me about a time when you failed”), or brainteaser in nature. Results of a multiple regression, controlling for interviewing experience and sex, showed that narcissism and sadism explained the likelihood of using brainteasers in an interview. A subsequent bifactor analysis showed that these dark traits shared a callousness general factor. A second longitudinal study of employed adults with hiring experience demonstrated that perspective‐taking partially mediated the relationship between this general factor and the perceived helpfulness and abusiveness of brainteaser interview questions. These results suggest that a callous indifference and a lack of perspective‐taking may underlie abusive behaviour in the employment interview.
tech  interview 
4 weeks ago
Apple Differential Privacy Technical Overview
Apple uses local differential privacy to help protect the privacy of user activity in a
given time period, while still gaining insight that improves the intelligence and
usability of such features as:
• QuickType suggestions
• Emoji suggestions
• Lookup Hints
• Safari Energy Draining Domains
• Safari Autoplay Intent Detection (macOS High Sierra)
• Safari Crashing Domains (iOS 11)
• Health Type Usage (iOS 10.2)
5 weeks ago
Differential privacy, part 3: Extraordinary claims require extraordinary scrutiny - Access Now
Using a custom system, called RAPPOR, Google collects probabilistic information about Chrome users’ browsing patterns. With the system, the browser converts strings of characters (like URLs) to short strings of bits using a hash function, then adds probabilistic noise and reports the results to Google. Once Google has collected hundreds of thousands of these private hashed strings, they can learn which strings are most common without knowing any one person’s real string. As described in our first post, this is a locally private system, and Google acts as an untrusted aggregator.

Finally, it’s important to reiterate that differential privacy is a specific, extraordinarily strict mathematical standard. It should never become the single metric by which privacy is assessed. In most cases, it’s above and beyond what’s necessary to keep data “private” in the conventional sense, and for many tasks it’s impossible to build differentially private systems that do anything useful. Companies should try to embrace differential privacy for the right problems, but when they make extraordinary claims about their systems, they must expect to be held to extraordinary standards.
5 weeks ago
Differential privacy, part 2: It’s complicated - Access Now
Apple has incorporated local differential privacy into its operating systems in order to figure out which emojis users substitute for words, which Spotlight links they click on, and get basic statistics from its Health app. Uber has developed a globally private system to allow its engineers to study large-scale ride patterns without touching raw user data. And Google has incorporated local privacy into Chrome to report a bevy of statistics, including which websites cause the browser to crash most often.

Even when lots of users are involved, differential privacy isn’t a silver bullet. It only really works to answer simple questions, with answers that can be expressed as a single number or as counts of a few possibilities. Think of political polls: pollsters ask yes/no or multiple-choice questions and use them to get approximate results which are expressed in terms of uncertainty (e.g. 48% ± 2). Differentially private results have similar margins of error, determined by epsilon. For more complex data, such methods usually add so much noise that the results are pretty much useless. A differentially private photo would be a meaningless slab of randomly colored pixels, and a private audio file would be indistinguishable from radio static. Differential privacy is not, and never will be, a replacement for good, old-fashioned strong encryption.

For example, Apple built local privacy into MacOS, which means the OS should protect users’ data from Apple itself. However, according to researchers, the MacOS implementation actually grants Apple a fresh privacy budget every single day. That practice allows Apple — the untrusted party — to accumulate more information about each user on each subsequent day. With every set of responses, the company can become more certain about the true nature of each user’s data.

Second is the issue of collusion. Suppose Mrs. Alice, a teacher, has a private set of student grades, and grants Bob and Betty privacy budgets of ϵ = 10 each to query it (remember, ϵ measures how private / noisy the data are). Both of them can make the same set of queries independently, using up their own privacy budgets. However, if the two collude, and Bob shares his answers with Betty or vice versa, the total privacy loss in Mrs. Alice’s system can jump to ϵ = 20, which is less private.
5 weeks ago
Understanding differential privacy and why it matters for digital rights - Access Now
In a globally private system, one trusted party, whom we’ll call the curator — like Alice, above — has access to raw, private data from lots of different people. She does analysis on the raw data and adds noise to answers after the fact. For example, suppose Alice is recast as a hospital administrator. Bob, a researcher, wants to know how many patients have the new Human Stigmatization Virus (HSV), a disease whose symptoms include inexplicable social marginalization. Alice uses her records to count the real number. To apply global privacy, she chooses a number at random (using a probability distribution, like the Laplacian, which both parties know). She adds the random “noise” to the real count, and tells Bob the noisy sum. The number Bob gets is likely to be very close to the real answer. Still, even if Bob knows the HSV status of all but one of the patients in the hospital, it is mathematically impossible for him to learn whether any particular patient is sick from Alice’s answer.

With local privacy, there is no trusted party; each person is responsible for adding noise to their own data before they share it. It’s as though each person is a curator in charge of their own private database. Usually, a locally private system involves an untrusted party (let’s call them the aggregator) who collects data from a big group of people at once. Imagine Bob the researcher has turned his attention to politics. He surveys everyone in his town, asking, “Are you or have you ever been a member of the Communist party?” To protect their privacy, he has each participant flip a coin in secret. If their coin is heads, they tell the truth, if it’s tails, they flip again, and let that coin decide their answer for them (heads = yes, tails = no). On average, half of the participants will tell the truth; the other half will give random answers. Each participant can plausibly deny that their response was truthful, so their privacy is protected. Even so, with enough answers, Bob can accurately estimate the portion of his community who support the dictatorship of the proletariat. This technique, known as “random response,” is an example of local privacy in action.

Differentially private systems are assessed by a single value, represented by the Greek letter epsilon (ϵ). ϵ is a measure of how private, and how noisy, a data release is. Higher values of ϵ indicate more accurate, less private answers; low-ϵ systems give highly random answers that don’t let would-be attackers learn much at all. One of differential privacy’s great successes is that it reduces the essential trade-off in privacy-preserving data analysis — accuracy vs. privacy — to a single number.

Privacy degrades with repeated queries, and epsilons add up. If Bob make the same private query with ϵ = 1 twice and receives two different estimates, it’s as if he’s made a single query with a loss of ϵ = 2. This is because he can average the answers together to get a more accurate, less privacy-preserving estimate. Systems can address this with a privacy “budget:” an absolute limit on the privacy loss that any individual or group is allowed to accrue. Private data curators have to be diligent about tracking who queries them and what they ask.

Unfortunately, there’s not much consensus about what values of ϵ are actually “private enough.” Most experts agree that values between 0 and 1 are very good, values above 10 are not, and values between 1 and 10 are various degrees of “better than nothing.” Furthermore, the parameter ϵ is exponential: by one measure, a system with ϵ = 1 is almost three times more private than ϵ = 2, and over 8,000 times more private than ϵ = 10. Apple was allegedly using privacy budgets as high as ϵ = 14 per day, with unbounded privacy loss over the long term.
5 weeks ago
Differential Privacy for Dummies – Roberto Agostino Vitillo's Blog
Informal but mathematical introduction (with further reading).

Differential privacy formalizes the idea that a query should not reveal whether any one person is present in a dataset, much less what their data are. Imagine two otherwise identical datasets, one with your information in it, and one without it. Differential Privacy ensures that the probability that a query will produce a given result is nearly the same whether it’s conducted on the first or second dataset.

A powerful property of differential privacy is that mechanisms can easily be composed. These require the key assumption that the mechanisms operate independently given the data.

A powerful property of differential privacy is that mechanisms can easily be composed. These require the key assumption that the mechanisms operate independently given the data.

The first mechanism we will look into is “randomized response”, a technique developed in the sixties by social scientists to collect data about embarrassing or illegal behavior. The study participants have to answer a yes-no question in secret using the following mechanism M_R(d, \alpha, \beta) :

The Laplace mechanism is used to privatize a numeric query. For simplicity we are going to assume that we are only interested in counting queries f , i.e. queries that count individuals, hence we can make the assumption that adding or removing an individual will affect the result of the query by at most 1.
5 weeks ago
Prototype Pattern
The Prototype Pattern is not as convenient as the mechanisms available in the Python language, but this clever simplification made it much easier for the Gang of Four to accomplish parametrized object creation in some of the underpowered Object Oriented languages that were popular last century.
patterns  oop 
5 weeks ago
Why the Future of Machine Learning is Tiny « Pete Warden's blog
Mostly about inference (rather than training), but lots of useful back-of-the-envelope arguments for the particular significance of edge computing in ML

The overall thing to take away from these figures is that processors and sensors can scale their power usage down to microwatt ranges (for example Qualcomm’s Glance vision chip, even energy-harvesting CCDs, or microphones that consume just hundreds of microwatts) but displays and especially radios are constrained to much higher consumption, with even low-power wifi and bluetooth using tens of milliwatts when active. The physics of moving data around just seems to require a lot of energy. There seems to be a rule that the energy an operation takes is proportional to how far you have to send the bits. CPUs and sensors send bits a few millimeters, and is cheap, radio sends them meters or more and is expensive. I don’t see this relationship fundamentally changing, even as technology improves overall. In fact, I expect the relative gap between the cost of compute and radio to get even wider, because I see more opportunities to reduce computing power usage.

A few years ago I talked to some engineers working on micro-satellites capturing imagery. Their problem was that they were essentially using phone cameras, which are capable of capturing HD video, but they only had a small amount of memory on the satellite to store the results, and only a limited amount of bandwidth every few hours to download to the base stations on Earth. I realized that we face the same problem almost everywhere we deploy sensors. Even in-home cameras are limited by the bandwidth of wifi and broadband connections. My favorite example of this was a friend whose December ISP usage was dramatically higher than the rest of the year, and when he drilled down it was because his blinking Christmas lights caused the video stream compression ratio to drop dramatically, since so many more frames had differences!
machinelearning  hardware 
5 weeks ago
Data protectionism: the growing menace to global business | Financial Times
Scania is well used to its vehicles being delayed at border crossings by officious customs officers and laborious paperwork. Yet these days, the Swedish truck company’s business is hindered as much by international obstructions to its data as roadblocks on its lorries.

As a Scania truck is driven through the EU, a small box sends diagnostic data — speed, fuel use, engine performance, even driving technique — to the company’s headquarters in Sweden. The information adds to a vast international database that helps owners manage the servicing of their fleet and Scania improve the manufacturing of the next generation of vehicles.

“The world is moving towards an autonomous, electrified transport system, and that needs data,” says Hakan Schildt of Scania’s connected services operation. “Transport is becoming a data business.”

In China, however, which severely restricts international transfers of data, the company incurs extra costs setting up local data storage and segregating some of the information from the rest of its operations. Many countries are imposing similar, if less drastic, restrictions. “We are having to regionalise a lot of our operations and set up local data processing,” Mr Schildt says. “National legislation is always changing.”

Many EU countries have curbs on moving personal data even to other member states. Studies for the Global Commission on Internet Governance, an independent research project, estimates that current constraints — such as restrictions on moving data on banking, gambling and tax records — reduces EU GDP by half a per cent.

In China, the champion data localiser, restrictions are even more severe.

China’s Great Firewall has long blocked most foreign web applications, and a cyber security law passed in 2016 also imposed rules against exporting personal information, forcing companies including Apple and LinkedIn to hold information on Chinese users on local servers. Beijing has also given itself a variety of powers to block the export of “important data” on grounds of reducing vaguely defined economic, scientific or technological risks to national security or the public interest.

While the US often tries to export its product standards in trade diplomacy, the EU tends to write rules for itself and let the gravity of its huge market pull other economies into its regulatory orbit. Businesses faced with multiple regulatory regimes will tend to work to the highest standard, known widely as the “Brussels effect”.

The EU appears to be trying something similar with data. It is exporting digital governance not through reciprocal deals but unilaterally bestowing an “adequacy” recognition on trading partners before allowing them to transfer data. The agreements acknowledge their partners’ rules, or at least the practices of their companies, meet EU standards. Companies such as Facebook have promised to follow GDPR throughout their global operations as the price of operating in Europe.
politics  privacy  data  federatedlearning  law  eu  china 
5 weeks ago
yes, but did it work? evaluating variational inference
After the accepted ICML papers were announced, I went through it hunting for relevant work. I've decided it's a better use of my time to read papers that have been accepted somewhere, rather than drowning under the firehouse of my arXiv RSS feed. This paper ticked two boxes: variational inference, and knowing if it worked. It also ticked a third, secret box of "titles that make it sound like the paper will have been written in a casual, conversational style, eschewing the tradition of appearing smarter by obfuscating the point".
statistics  machinelearning  probabilisticprogramming  academia  arxiv 
5 weeks ago
Guidelines For Ab Testing
Some good advice buried in here (along with the flawed belief that Bayesian methods are more complicated and therefore implicitly less robust than “traditional” frequentist approaches).
statistics  abtesting 
6 weeks ago
Shipping Software Should Not Be Scary
“In other words, make deploy tools a first class citizen of your technical toolset. Make the work prestigious and valued — even aspirational. If you do performance reviews, recognize the impact there.”
programming  production 
6 weeks ago
We're Not a Bootcamp – Launch School – Medium
If the bootcamp model is an approach that emphasizes high intensity over a short period, then the mastery-based learning model is its counterpart. In mastery-based learning (MBL), there are no schedules, no deadlines, and no promises of quick return on investment. Rather, MBL calls for a learner to stick with a topic until she has mastered it. The idea is that sustained, medium-intensity effort over a long period of time results in deeper roots than a very high-intensity effort over a short period of time. Learners are not just encouraged to delve deeply into concepts, but they are required to do so before they move on. There is no pressure to advance simply because your cohort is advancing.

MBL, as a pedagogy, can be applied to any field of learning, but it is particularly potent in fields where the fundamentals are both vital and cumulative, meaning that they necessarily build on top of one another in a structured fashion. Programming is one of these fields. In an MBL approach, if a learner is having trouble with working with loops, they do not move on to learning about algorithms.

Imagine, for example, an accountant who hasn’t mastered basic math skills. What kind of job can they expect? Current market conditions and demand for accountants might mean that they can get a job, but it probably won’t be a very interesting one. If all you can reliably do is manipulate spreadsheets and enter data into Turbotax then that’s all your job is going to let you do. There is an equivalent in the programming world, which I call the “minimal viable developer.” It is a developer who knows a framework or two and has memorized how to set up certain tools, but who doesn’t really have the depth to do much more. In other words, you can get a developer job with relatively little knowledge, but it won’t be a very good one. Moreover, when market conditions change you may not even be able to get a low-end job.
teaching  programming 
6 weeks ago
Reader, Come Home by Maryanne Wolf, reviewed.
Wolf resolved to allot a set period every day to reread a novel she had loved as a young woman, Hermann Hesse’s Magister Ludi. It was exactly the sort of demanding text she’d once reveled in. But now she discovered to her dismay that she could not bear it. “I hated the book,” she writes. “I hated the whole so-called experiment.” She had to force herself to wrangle the novel’s “unnecessarily difficult words and sentences whose snakelike constructions obfuscated, rather than illuminated, meaning for me.” The narrative action struck her as intolerably slow. She had, she concluded, “changed in ways I would never have predicted. I now read on the surface and very quickly; in fact, I read too fast to comprehend deeper levels, which forced me constantly to go back and reread the same sentence over and over with increasing frustration.” She had lost the “cognitive patience” that once sustained her in reading such books. She blamed the internet.

Wolf refers back to a famous story from Phaedrus, in which Socrates cautioned against literacy, arguing that knowledge is not fixed but the product of a dialogue between the speaker and listener—that the great weakness of a text is that you can’t ask it questions and make it justify its conclusions. Wolf uses the story to point out the futility of rejecting a powerful new communication technology like the book, but she doesn’t seem to have noticed that the internet more closely resembles Socrates’ ideal than the printed page does.
books  reading 
6 weeks ago
Challenges in Maintaining A Big Tent for Software Freedom - Conservancy Blog - Software Freedom Conservancy
It means we have a big tent for software freedom, and we sometimes stand under it with people whose behavior we despise. The value we have is our ability to stand with them under the tent, and tell them: “while I respect your right to share and improve that software, I find the task you're doing with the software deplorable.”. That's the message I deliver to any ICE agent who used Free Software while forcibly separating parents from their children.
Ethics  opensource  politics  usa 
6 weeks ago
A Road to Common Lisp / Steve Losh
I’ve gotten a bunch of emails asking for advice on how to learn Common Lisp in the present day. I decided to write down all the advice I’ve been giving through email and social media posts in the hopes that someone might find it useful.
programming  lisp 
6 weeks ago
📚The Current Best of Universal Word Embeddings and Sentence Embeddings
Concise clear review of the story so far in sentence embeddings, with particular focus on 2017/18 developments (universal encoders, multitask)
machinelearning  nlp 
6 weeks ago
Why I Enjoy Blogging - In Pursuit of Laziness
Blogging helps cement my understanding of things!
It’s really fun to revisit old posts!
It lets me exercise a different headspace!
Blogging lets me be lazy!
It’s okay if folks have written about it before!
I kinda feel it’s my duty to?
6 weeks ago
To All the Posts I’ve Blogged Before
Writing Helps Me Think
Writing Helps Me Remember
Writing Deduplicates Labor
Blogs Don’t Need to Be Unique
Blogs Don’t Need to Be Theses
6 weeks ago
The Law of Leaky Abstractions
“Code generation tools which pretend to abstract out something, like all abstractions, leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning.”

“And all this means that paradoxically, even as we have higher and higher level programming tools with better and better abstractions, becoming a proficient programmer is getting harder and harder.”
6 weeks ago
How (and why) to create a good validation set · fast.ai
The underlying idea is that:

• the training set is used to train a given model
• the validation set is used to choose between models (for instance, does a random forest or a neural net work better for your problem? do you want a random forest with 40 trees or 50 trees?)
• the test set tells you how you’ve done. If you’ve tried out a lot of different models, you may get one that does well on your validation set just by chance, and having a test set helps make sure that is not the case.

A key property of the validation and test sets is that they must be representative of the new data you will see in the future.

When is a random subset not good enough?

If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future).

You also need to think about what ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with. In the Kaggle distracted driver competition, the independent data are pictures of drivers at the wheel of a car, and the dependent variable is a category such as texting, eating, or safely looking ahead. The test data consists of people that weren’t used in the training set.

A similar dynamic was at work in the Kaggle fisheries competition to identify the species of fish caught by fishing boats in order to reduce illegal fishing of endangered populations. The test set consisted of boats that didn’t appear in the training data. This means that you’d want your validation set to include boats that are not in the training set.

The dangers of cross-validation

For example, for a 3-fold cross validation, the data is divided into 3 sets: A, B, and C. A model is first trained on A and B combined as the training set, and evaluated on the validation set C. Next, a model is trained on A and C combined as the training set, and evaluated on validation set B. And so on, with the model performance from the 3 folds being averaged in the end.

However, the problem with cross-validation is that it is rarely applicable to real world problems, for all the reasons described in the above sections. Cross-validation only works in the same cases where you can randomly shuffle your data to choose a validation set.
7 weeks ago
The presence prison – Signal v. Noise
Truth is, there are hardly any good reasons to know if someone’s available or away at any given moment. If you truly need something from someone, ask them. If they respond, then you have what you needed. If they don’t, it’s not because they’re ignoring you — it’s because they’re busy. Respect that! Assume people are focused on their own work.

Are there exceptions? Of course. It might be good to know who’s around in a true emergency, but 1% occasions like that shouldn’t drive policy 99% of the time. And there are times where certain teams need to make sure someone’s around so there are no gaps in customer service coverage, but those are specialized cases best handled by communication, not an ambiguous colored dot next to someone’s name.
7 weeks ago
Encrypt your Machine Learning – corti.ai – Medium
Assume your customers are unable to give you their data for privacy or security reasons. Which means, if you want to apply your models on their data, you have to bring the model to them. But if sharing your valuable model is impossible or you are limited by privacy concerns, encrypting your model might be an option. You can train your model, encrypt it, and send it to your customers. In order for the customer to actually use the prediction, you have to provide them with a decryption service.

The first fully homomorphic algorithm was incredibly slow, taking 100 trillion times as long to perform calculations of encrypted data than plaintext analysis.⁶

IBM has sped things up considerably, making calculations on a 16-core server over two million times faster than past systems.

For the smallest parameter set, the time required for a homomorphic multiplication of ciphertexts was measured to be 3.461 milliseconds.

Computational requirements are not the only concern — we also have to consider the size of the encrypted data or model.

In conclusion, the encrypted data is one to three orders of magnitude larger than the unencrypted data. The exact factor depends on what is considered a natural representation of the data in its raw form.
machinelearning  cryptography  security  federatedlearning 
7 weeks ago
Privacy and machine learning: two unexpected allies? | cleverhans-blog
Machine learning algorithms work by studying a lot of data and updating their parameters to encode the relationships in that data. Ideally, we would like the parameters of these machine learning models to encode general patterns (‘‘patients who smoke are more likely to have heart disease’’) rather than facts about specific training examples (“Jane Smith has heart disease”). Unfortunately, machine learning algorithms do not learn to ignore these specifics by default.

We use a version of differential privacy which requires that the probability of learning any particular set of parameters stays roughly the same if we change a single training example in the training set. This could mean to add a training example, remove a training example, or change the values within one training example. The intuition is that if a single patient (Jane Smith) does not affect the outcome of learning, then that patient’s records cannot be memorized and her privacy is respected. In the rest of this post, we often refer to this probability as the privacy budget.

We achieve differential privacy when the adversary is not able to distinguish the answers produced by the randomized algorithm based on the data of two of the three users from the answers returned by the same algorithm based on the data of all three users.

Our PATE approach at providing differential privacy to machine learning is based on a simple intuition: if two different classifiers, trained on two different datasets with no training examples in common, agree on how to classify a new input example, then that decision does not reveal information about any single training example.

Differential privacy is in fact well aligned with the goals of machine learning. For instance, memorizing a particular training point—like the medical record of Jane Smith—during learning is a violation of privacy. It is also a form of overfitting and harms the model’s generalization performance for patients whose medical records are similar to Jane’s. Moreover, differential privacy implies some form of stability (but the opposite is not necessarily true).
machinelearning  Privacy  differentialprivacy  federatedlearning 
7 weeks ago
« earlier      
1970s 20c abtesting academia adversarial advertising ai algorithms amazon anecdata antarctica api apple architecture art arxiv astro async audio aws backup bash bayes bias bitcoin book books brexit business c california capitalism car causality churn cia cloudfront concurrency conference crime cryptocurrency cryptography cs culture data database dataengineering datascience deeplearning design devops differentialprivacy diversity docker economics education engineering english espionage ethics eu europe facebook family federatedlearning feminism fiction film finance functional git github golang google h1b hardware haskell health hiring history housing immigration infrastructure internet interpretability interview investments jobs journalism js jupyter kubernetes labour lambda language law legal linearalgebra linux losangeles machinelearning macos make management map mapreduce maps marketing math maths me media module money music neuralnetworks newyork nlp notebook numpy nyc oop optimization package pandas patterns phone physics politics predictivemaintenance presentation privacy probabilisticprogramming probability product professional programming psephology publishing pycon2017 pymc3 pytest python pytorch quant r racism recipe recommendation reinforcementlearning remote research review rnn rust s3 sanfrancisco science scientism scifi scikitlearn security sentiment serverless sexism siliconvalley slack socialism socialmedia spark sql ssh ssl stan startup statistics summarization surveillance talk tax tech technology tensorflow testing text timeseries tmux transport travel tutorial tv twitter uber uk unix urbanism usa versioncontrol video vim visualization vpn web webdev word2vec writing

Copy this bookmark: