Deduplicating files in Public Git Archive · source{d} blog
This summer, we announced the release of Public Git Archive, a dataset with 3TB of Git data from the most starred repositories on GitHub. Now it’s time to tell how we tried to deduplicate files in the latest revision of the repositories in PGA using our research project for code deduplication, src-d/apollo. Before diving deep, let’s quickly see why we created it. To the best of our knowledge, the only efforts to detect code clones at massive scale have been made by Lopes et. al., who leveraged a huge corpus of over 428 million files in 4 languages to map code clones on GitHub (DéjàVu project). They relied on syntactic features, i.e. identifiers (my_list, your_list, …) and literals (if, for, …), to compute the similarity between a pair of files. PGA has fewer files in the latest (HEAD) revision - 54 million, and we did not want to give our readers a DéjàVu by repeating the same analysis. So we aimed at something different: not only copy-paste between files, but also involuntary rewrites of the same abstractions. Thus we extracted and used semantic features from Universal Abstract Syntax Trees.
Functional Programming Principles in Scala | Coursera
Functional Programming Principles in Scala from École Polytechnique Fédérale de Lausanne. Functional programming is becoming increasingly widespread in industry. This trend is driven by the adoption of Scala as the main programming language for ...
Decision Tables • Hillel Wayne
A decision table is a means of concisely representing branching and conditional computations. In the most basic form, you have some columns that represent the “inputs” as booleans and some columns that represent outputs and effects. It looks like this:
