jm + data   134

GraphQL
a query language for APIs and a runtime for fulfilling those queries with your existing data. GraphQL provides a complete and understandable description of the data in your API, gives clients the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time, and enables powerful developer tools.


Now being used by Facebook and Github -- looks quite interesting.
apis  data  github  facebook  graphql  languages  types 
yesterday by jm
Seeking medical abortions online is safe and effective, study finds | World news | The Guardian
Of the 1,636 women who were sent the drugs between the start of 2010 and the end of 2012, the team were able to analyse self-reported data from 1,000 individuals who confirmed taking the pills. All were less than 10 weeks pregnant.

The results reveal that almost 95% of the women successfully ended their pregnancy without the need for surgical intervention. None of the women died, although seven women required a blood transfusion and 26 needed antibiotics.
Of the 93 women who experienced symptoms for which the advice was to seek medical attention, 95% did so, going to a hospital or clinic.

“When we talk about self-sought, self-induced abortion, people think about coat hangers or they think about tables in back alleys,” said Aiken. “But I think this research really shows that in 2017 self-sourced abortion is a network of people helping and supporting each other through what’s really a safe and effective process in the comfort of their own homes, and I think is a huge step forward in public health.”
health  medicine  abortion  pro-choice  data  women-on-web  ireland  law  repealthe8th 
10 days ago by jm
The great British Brexit robbery: how our democracy was hijacked | Technology | The Guardian

A map shown to the Observer showing the many places in the world where SCL and Cambridge Analytica have worked includes Russia, Lithuania, Latvia, Ukraine, Iran and Moldova. Multiple Cambridge Analytica sources have revealed other links to Russia, including trips to the country, meetings with executives from Russian state-owned companies, and references by SCL employees to working for Russian entities.

Article 50 has been triggered. AggregateIQ is outside British jurisdiction. The Electoral Commission is powerless. And another election, with these same rules, is just a month away. It is not that the authorities don’t know there is cause for concern. The Observer has learned that the Crown Prosecution Service did appoint a special prosecutor to assess whether there was a case for a criminal investigation into whether campaign finance laws were broken. The CPS referred it back to the electoral commission. Someone close to the intelligence select committee tells me that “work is being done” on potential Russian interference in the referendum.

Gavin Millar, a QC and expert in electoral law, described the situation as “highly disturbing”. He believes the only way to find the truth would be to hold a public inquiry. But a government would need to call it. A government that has just triggered an election specifically to shore up its power base. An election designed to set us into permanent alignment with Trump’s America. [....]

This isn’t about Remain or Leave. It goes far beyond party politics. It’s about the first step into a brave, new, increasingly undemocratic world.
elections  brexit  trump  cambridge-analytica  aggregateiq  scary  analytics  data  targeting  scl  ukip  democracy  grim-meathook-future 
19 days ago by jm
'Mathwashing,' Facebook and the zeitgeist of data worship
Fred Benenson: Mathwashing can be thought of using math terms (algorithm, model, etc.) to paper over a more subjective reality. For example, a lot of people believed Facebook was using an unbiased algorithm to determine its trending topics, even if Facebook had previously admitted that humans were involved in the process.
maths  math  mathwashing  data  big-data  algorithms  machine-learning  bias  facebook  fred-benenson 
5 weeks ago by jm
Tad
'A Desktop Viewer App for Tabular Data' -- pivot CSV data easily; works well with large files; free, from Antony Courtney
dataviz  osx  csv  data  pivot-tables  analysis  desktop 
7 weeks ago by jm
pachyderm
'Containerized Data Analytics':
There are two bold new ideas in Pachyderm:

Containers as the core processing primitive
Version Control for data

These ideas lead directly to a system that's much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!
analytics  data  containers  golang  pachyderm  tools  data-science  docker  version-control 
february 2017 by jm
Data from pacemaker used to arrest man for arson, insurance fraud
Compton has medical conditions which include an artificial heart linked to an external pump. According to court documents, a cardiologist said that "it is highly improbable Mr. Compton would have been able to collect, pack and remove the number of items from the house, exit his bedroom window and carry numerous large and heavy items to the front of his residence during the short period of time he has indicated due to his medical conditions."

After US law enforcement caught wind of this peculiar element to the story, police were able to secure a search warrant and collect the pacemaker's electronic records to scrutinize his heart rate, the demand on the pacemaker and heart rhythms prior to and at the time of the incident.
pacemakers  health  medicine  privacy  data  arson  insurance  fraud  heart 
february 2017 by jm
The Rise of the Data Engineer
Interesting article proposing a new discipline, focused on the data warehouse, from Maxime Beauchemin (creator and main committer on Apache Airflow and Airbnb’s Superset)
data-engineering  engineering  coding  data  big-data  airbnb  maxime-beauchemin  data-warehouse 
january 2017 by jm
Sankey diagram - Wikipedia
'a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity. Sankey diagrams put a visual emphasis on the major transfers or flows within a system. They are helpful in locating dominant contributions to an overall flow. Often, Sankey diagrams show conserved quantities within defined system boundaries. [....]

One of the most famous Sankey diagrams is Charles Minard's Map of Napoleon's Russian Campaign of 1812. It is a flow map, overlaying a Sankey diagram onto a geographical map.'
sankey  diagrams  dataviz  data  viz 
january 2017 by jm
Falsehoods Programmers Believe About CSVs
Much of my professional work for the last 10+ years has revolved around handing, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.

Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.


(via Tony Finch)
via:fanf  csv  excel  programming  coding  apis  data  encoding  transfer  falsehoods  fail  rfc4180 
january 2017 by jm
How a Machine Learns Prejudice - Scientific American
Agreed, this is a big issue.
If artificial intelligence takes over our lives, it probably won’t involve humans battling an army of robots that relentlessly apply Spock-like logic as they physically enslave us. Instead, the machine-learning algorithms that already let AI programs recommend a movie you’d like or recognize your friend’s face in a photo will likely be the same ones that one day deny you a loan, lead the police to your neighborhood or tell your doctor you need to go on a diet. And since humans create these algorithms, they're just as prone to biases that could lead to bad decisions—and worse outcomes.
These biases create some immediate concerns about our increasing reliance on artificially intelligent technology, as any AI system designed by humans to be absolutely "neutral" could still reinforce humans’ prejudicial thinking instead of seeing through it.
prejudice  bias  machine-learning  ml  data  training  race  racism  google  facebook 
january 2017 by jm
Reproducible research: Stripe’s approach to data science
This is intriguing -- using Jupyter notebooks to embody data analysis work, and ensure it's reproducible, which brings better rigour similarly to how unit tests improve coding. I must try this.
Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.
stripe  coding  data-science  reproducability  science  jupyter  notebooks  analysis  data  experiments 
november 2016 by jm
The Fall of BIG DATA – arg min blog
Strongly agreed with this -- particularly the second of the three major failures, specifically:
Our community has developed remarkably effective tools to microtarget advertisements. But if you use ad models to deliver news, that’s propaganda. And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible.
big-data  analytics  data-science  statistics  us-politics  trump  data  science  propaganda  facebook  silicon-valley 
november 2016 by jm
seriot.ch - Parsing JSON is a Minefield 💣
Crockford chose not to version [the] JSON definition: 'Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.' Yet JSON is defined in at least six different documents.


"Boldest". ffs. :facepalm:
bold  courage  json  parsing  coding  data  formats  interchange  fail  standards  confusion 
october 2016 by jm
Individual children's details passed to Home Office for immigration purposes | UK news | The Guardian
The UK's version of the POD database project was used by the Home Office to track immigrants for various reasons -- in other words, exactly the reasons why parents will choose not to provide that data
parents  databases  data  pod  uk  home-office  education  schools 
october 2016 by jm
Osso
"A modern standard for event-oriented data". Avro schema, events have time and type, schema is external and not part of the Avro stream.

'a modern standard for representing event-oriented data in high-throughput operational systems. It uses existing open standards for schema definition and serialization, but adds semantic meaning and definition to make integration between systems easy, while still being size- and processing-efficient.

An Osso event is largely use case agnostic, and can represent a log message, stack trace, metric sample, user action taken, ad display or click, generic HTTP event, or otherwise. Every event has a set of common fields as well as optional key/value attributes that are typically event type-specific.'
osso  events  schema  data  interchange  formats  cep  event-processing  architecture 
september 2016 by jm
Engineering Intelligence Through Data Visualization at Uber
bloody hell, Uber have a 15-person dataviz team. More money than sense! The resulting output is pretty though
data  dataviz  visualization  webgl  uber  mapping 
august 2016 by jm
Prepaid Data SIM Card Wiki
awesome resource.
This WIKI collects information about prepaid (or PAYG) mobile phone plans from all over the world. Not just any plans though, they must include good data rates, perfect for smartphone travellers, as well as tablet or mobile modem users.
data  mobile  travel  sim  prepaid  payg 
august 2016 by jm
MRI software bugs could upend years of research - The Register
In their paper at PNAS, they write: “the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.”

For example, a bug that's been sitting in a package called 3dClustSim for 15 years, fixed in May 2015, produced bad results (3dClustSim is part of the AFNI suite; the others are SPM and FSL). That's not a gentle nudge that some results might be overstated: it's more like making a bonfire of thousands of scientific papers.

Further: “Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape”.

The researchers used published fMRI results, and along the way they swipe the fMRI community for their “lamentable archiving and data-sharing practices” that prevent most of the discipline's body of work being re-analysed. ®
fmri  science  mri  statistics  cluster-inference  autocorrelation  data  papers  medicine  false-positives  fps  neuroimaging 
july 2016 by jm
Self-driving cars: overlooking data privacy is a car crash waiting to happen
Interesting point -- self-driving cars are likely to be awash in telemetry data, "phoned home"
self-driving  cars  vehicles  law  data  privacy  data-privacy  surveillance 
july 2016 by jm
Public preferences for electronic health data storage, access, and sharing – evidence from a pan-European survey | Journal of the American Medical Informatics Association
Results: We obtained 20 882 survey responses (94 606 preferences) from 27 EU member countries. Respondents recognized the benefits of storing electronic health information, with 75.5%, 63.9%, and 58.9% agreeing that storage was important for improving treatment quality, preventing epidemics, and reducing delays, respectively. Concerns about different levels of access by third parties were expressed by 48.9% to 60.6% of respondents. On average, compared to devices or systems that only store basic health status information, respondents preferred devices that also store identification data (coefficient/relative preference 95% CI = 0.04 [0.00-0.08], P = 0.034) and information on lifelong health conditions (coefficient = 0.13 [0.08 to 0.18], P < 0.001), but there was no evidence of this for devices with information on sensitive health conditions such as mental and sexual health and addictions (coefficient = −0.03 [−0.09 to 0.02], P = 0.24). Respondents were averse to their immediate family (coefficient = −0.05 [−0.05 to −0.01], P = 0.011) and home care nurses (coefficient = −0.06 [−0.11 to −0.02], P = 0.004) viewing this data, and strongly averse to health insurance companies (coefficient = −0.43 [−0.52 to 0.34], P < 0.001), private sector pharmaceutical companies (coefficient = −0.82 [−0.99 to −0.64], P < 0.001), and academic researchers (coefficient = −0.53 [−0.66 to −0.40], P < 0.001) viewing the data.

Conclusions: Storing more detailed electronic health data was generally preferred, but respondents were averse to wider access to and sharing of this information. When developing frameworks for the use of electronic health data, policy makers should consider approaches that both highlight the benefits to the individual and minimize the perception of privacy risks.


Via Antoin.
privacy  data  medicine  health  healthcare  papers  via:antoin 
april 2016 by jm
ZIP SIM
Prepaid talk+text+data or data-only mobile SIM cards, delivered to your home or hotel, prior to visiting the US. great service for temporary US business visits
visiting  us  usa  zip-sim  sims  mobile-phones  travel  phones  mobile  travelling  data 
april 2016 by jm
Health of purebred vs mixed breed dogs: the actual data - The Institute of Canine Biology

This study found that purebred dogs have a significantly greater risk of developing many of the hereditary disorders examined in this study. No, mixed breed dogs are not ALWAYS healthier than purebreds; and also, purebreds are not "as healthy" as mixed breed dogs. The results of this study will surprise nobody who understands the basics of Mendelian inheritance. Breeding related animals increases the expression of genetic disorders caused by recessive mutations, and it also increases the probability of producing offspring that will inherit the assortment of genes responsible for a polygenic disorder. 


In conclusion, go mutts.
dogs  breeding  genetics  hereditary-disorders  science  inheritance  recessive-mutation  data 
march 2016 by jm
There’s Something Fishy About The Other Nefertiti
The last possibility and reigning theory is that Ms. Badri and Mr. Nelles elusive hacker partners are literally real hackers who stole a copy of the high resolution scan from the Museum’s servers. A high resolution scan must exist as a high res 3D printed replica is already available for sale online. Museum officials have dismissed the Other Nefertiti model as “of minor quality”, but that’s not what we are seeing in this highly detailed scan. Perhaps the file was obtained from someone involved in printing the reproduction, or it was a scan made of the reproduction? Indeed, the common belief in online 3D Printing community chatter is that the Kinect “story” is a fabrication to hide the fact that the model was actually stolen data from a commercial high quality scan. If the artists were behind a server hack, the legal ramifications for them are much more serious than scanning the object, which has few, if any legal precedents.
art  history  3d-printing  3d  nefertiti  heists  copyright  data  kinect 
march 2016 by jm
TeleGeography Submarine Cable Map 2015
Gorgeously-illustrated retro map of modern-day submarine cables. Prints available for $150 (via Conor Delaney)
via:conor-delaney  data  internet  maps  cables  world  telegeography  mapping  retro 
march 2016 by jm
Microsoft warns of risks to Irish operation in US search warrant case

“Our concern is that if we lose the case more countries across Europe or elsewhere are going to be concerned about having their data in Ireland, ” Mr Smith said, after testifying before the House judiciary committee.
Asked what would happen to its Irish unit if the company loses the case or doesn’t convince Congress to pass updated legislation governing cross-border data held by American companies, the Microsoft executive said: “We’ll certainly face a new set of risks that we don’t face today.”
He added that the issue could be resolved by an executive order by the White House or through international negotiations between the Irish Government or the European Union and the US.
microsoft  data  privacy  us-politics  surveillance  usa 
february 2016 by jm
Lasers reveal 'lost' Roman roads
UK open data success story, via Tony Finch:
This LIDAR data bonanza has proved particularly helpful to archaeologists seeking to map Roman roads that have been ‘lost’, some for thousands of years. Their discoveries are giving clues to a neglected chapter in the history of Roman Britain: the roads built to help Rome’s legions conquer and control northern England.
uk  government  lidar  open-data  data  roman  history  mapping  geodata 
february 2016 by jm
Plotly
Online chart maker for CSV and Excel data; make charts and dashboards online. One really nice feature is that charts made this way get permalinks, and can be easily inlined as PNGs or HTML5 divs. (See https://www.vividcortex.com/blog/analyzing-sparks-mpp-scalability-with-the-usl for an example.)
data  javascript  python  tools  visualization  dataviz  charts  graphing  web  plotly  plots  graphs 
january 2016 by jm
Roads to Rome
'At least for Europe it is obvious: All roads lead to Rome! You can reach the eternal city on almost 500.000 routes from all across the continent. Which road would you take?
To approach one of the biggest unsolved quests of mobility, the first question we asked ourselves was: Where do you start, when you want to know every road to Rome? We aligned starting points in a 26.503.452 km² grid covering all of Europe. Every cell of this grid contains the starting point to one of our journeys to Rome.
Now that we have our 486.713 starting points we need to find out how we could reach Rome as our destination. For this we created a algorithm that calculates one route for every trip. The more often a single street segment is used, the stronger it is drawn on the map. The maps as outcome of this project is somewhere between information visualization and data art, unveiling mobility and a very large scale.'

Beautiful! Decent-sized prints available for 26 euros too.
to-get  tobuy  rome  mapping  data  maps  europe  art 
december 2015 by jm
One of the Largest Hacks Yet Exposes Data on Hundreds of Thousands of Kids | Motherboard
VTech got hacked, and millions of parents and 200,000 kids had their privacy breached as a result. Bottom line is summed up by this quote from one affected parent:
“Why do you need know my address, why do you need to know all this information just so I can download a couple of free books for my kid on this silly pad thing? Why did they have all this information?”


Quite. Better off simply not to have the data in the first place!
vtech  privacy  data-protection  data  hacks 
november 2015 by jm
No Harm, No Fowl: Chicken Farm Inappropriate Choice for Data Disposal
That’s a lesson that Spruce Manor Special Care Home in Saskatchewan had to learn the hard way (as surprising as that might sound). As a trustee with custody of personal health information, Spruce Manor was required under section 17(2) of the Saskatchewan Health Information Protection Act to dispose of its patient records in a way that protected patient privacy. So, when Spruce Manor chose a chicken farm for the job, it found itself the subject of an investigation by the Saskatchewan Information and Privacy Commissioner.  In what is probably one of the least surprising findings ever, the commissioner wrote in his final report that “I recommend that Spruce Manor […] no longer use [a] chicken farm to destroy records”, and then for good measure added “I find using a chicken farm to destroy records unacceptable.”
data  law  privacy  funny  chickens  farming  via:pinboard  data-protection  health  medical-records 
november 2015 by jm
Open-sourcing PalDB, a lightweight companion for storing side data
a new LinkedIn open source data store, for write-once/read-mainly side data, java, Apache licensed.

RocksDB discussion: https://www.facebook.com/groups/rocksdb.dev/permalink/834956096602906/
linkedin  open-source  storage  side-data  data  config  paldb  java  apache  databases 
october 2015 by jm
England opens up 11TB of LiDAR data covering the entire country as open data
All 11 terabytes of our LIDAR data (that’s roughly equivalent to 2,750,000 MP3 songs) will eventually be available through our new Open LIDAR portal under an Open Government Licence, allowing it to be used for any purpose. We hope that by giving free access to our data businesses and local communities will develop innovative solutions to benefit the environment, grow our thriving rural economy, and boost our world-leading food and farming industry. The possibilities are endless and we hope that making LIDAR data open will be a catalyst for new ideas and innovation.


Are you reading, Ordnance Survey Ireland?
data  maps  uk  lidar  mapping  geodata  open-data  ogl 
october 2015 by jm
After Bara: All your (Data)base are belong to us
Sounds like the CJEU's Bara decision may cause problems for the Irish government's wilful data-sharing:
Articles 10, 11 and 13 of Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995, on the protection of individuals with regard to the processing of personal data and on the free movement of such data, must be interpreted as precluding national measures, such as those at issue in the main proceedings, which allow a public administrative body of a Member State to transfer personal data to another public administrative body and their subsequent processing, without the data subjects having been informed of that transfer or processing.
data  databases  bara  cjeu  eu  law  privacy  data-protection 
october 2015 by jm
The Totally Managed Analytics Pipeline: Segment, Lambda, and Dynamo
notable mainly for the details of Terraform support for Lambda: that's a significant improvement to Lambda's production-readiness
aws  pipelines  data  streaming  lambda  dynamodb  analytics  terraform  ops 
october 2015 by jm
5 takeaways from the death of safe harbor – POLITICO
Reacting to the ruling, the [EC] stressed that data transfers between the U.S. and Europe can continue on the basis of other legal mechanisms.

A lot rides on what steps the Commission and national data protection supervisors take in response. “It is crucial for legal certainty that the EC sends a clear signal,” said Nauwelaerts.

That could involve providing a timeline for concluding an agreement with U.S. authorities, together with a commitment from national data protection authorities not to block data transfers while negotiations are on-going, he explained.
safe-harbor  data  privacy  eu  ec  snowden  law  us 
october 2015 by jm
Elasticsearch and data loss
"@alexbfree @ThijsFeryn [ElasticSearch is] fine as long as data loss is acceptable. https://aphyr.com/posts/317-call-me-maybe-elasticsearch . We lose ~1% of all writes on average."
elasticsearch  data-loss  reliability  data  search  aphyr  jepsen  testing  distributed-systems  ops 
october 2015 by jm
EU court adviser: data-share deal with U.S. is invalid | Reuters
The Safe Harbor agreement does not do enough to protect EU citizen's private information when it reached the United States, Yves Bot, Advocate General at the European Court of Justice (ECJ), said. While his opinions are not binding, they tend to be followed by the court's judges, who are currently considering a complaint about the system in the wake of revelations from ex-National Security Agency contractor Edward Snowden of mass U.S. government surveillance.
safe-harbor  law  eu  ec  ecj  snowden  surveillance  privacy  us  data  max-schrems 
september 2015 by jm
Miller
'like sed, awk, cut, join, and sort for name-indexed data such as CSV'


Written in "modern C" with zero runtime dependencies. Looks great
cli  csv  unix  miller  tsv  data  tools 
august 2015 by jm
The world beyond batch: Streaming 101 - O'Reilly Media
To summarize, in this post I’ve:

Clarified terminology, specifically narrowing the definition of “streaming” to apply to execution engines only, while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.

Assessed the relative capabilities of well-designed batch and streaming systems, positing that streaming is in fact a strict superset of batch, and that notions like the Lambda Architecture, which are predicated on streaming being inferior to batch, are destined for retirement as streaming systems mature.

Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch, those being correctness and tools for reasoning about time, respectively.

Established the important differences between event time and processing time, characterized the difficulties those differences impose when analyzing data in the context of when they occurred, and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.

Looked at the major data processing approaches in common use today for bounded and unbounded data, via both batch and streaming engines, roughly categorizing the unbounded approaches into: time-agnostic, approximation, windowing by processing time, and windowing by event time.
streaming  batch  big-data  lambda-architecture  dataflow  event-processing  cep  millwheel  data  data-processing 
august 2015 by jm
minimaxir/big-list-of-naughty-strings
Late to this one -- a nice list of bad input (Unicode zero-width spaces, etc) for testing
testing  strings  text  data  unicode  utf-8  tests  input  corrupt 
august 2015 by jm
"Customer data is a liability, not an asset."
Great turn of phrase from Matthew Green (@matthew_d_green). Emin Gün Sirer adds some detail: "well, an asset with bounded value, and an unbounded liability"
data  privacy  data-protection  ashleymadison  hacks  security  liability 
july 2015 by jm
Government forum to discuss increasing use of personal data
Mr Murphy said it was the Government’s objective for Ireland to be a leader on data protection and data-related issues.
The members of the forum include Data Protection Commissioner Helen Dixon, John Barron, chief technology officer with the Revenue Commissioners, Seamus Carroll, head of civil law reform division at the Department of Justice and Tim Duggan, assistant secretary with the Department of Social Protection.
Gary Davis, director of privacy and law enforcement requests with Apple, is also on the forum. Mr Davis is a former deputy data protection commissioner in Ireland.
There are also representatives from Google, Twitter, LinkedIn and Facebook, from the IDA, the Law Society and the National Statistics Board.
Chair of Digital Rights Ireland Dr TJ McIntyre and Dr Eoin O’Dell, associate professor, School of Law, Trinity College Dublin are also on the voluntary forum.
ireland  government  dri  law  privacy  data  data-protection  dpc 
july 2015 by jm
Google Photos - Can I get out?
what's the export policy for Google's new Photos service? pretty good, it turns out
google  export  data  google-photos  photos  archive  history  storage 
june 2015 by jm
Elements of Scale: Composing and Scaling Data Platforms
Great, encyclopedic blog post rounding up common architectural and algorithmic patterns using in scalable data platforms. Cut out and keep!
architecture  storage  databases  data  big-data  scaling  scalability  ben-stopford  cqrs  druid  parquet  columnar-stores  lambda-architecture 
may 2015 by jm
streamtools: a graphical tool for working with streams of data | nytlabs
Visual programming, Yahoo! Pipes style, back again:
we have created streamtools – a new, open source project by The New York Times R&D Lab which provides a general purpose, graphical tool for dealing with streams of data. It provides a vocabulary of operations that can be connected together to create live data processing systems without the need for programming or complicated infrastructure. These systems are assembled using a visual interface that affords both immediate understanding and live manipulation of the system.


via Aman
via:akohli  streaming  data  nytimes  visual-programming  coding 
may 2015 by jm
Call me maybe: Elasticsearch 1.5.0
tl;dr: Elasticsearch still hoses data integrity on partition, badly
elasticsearch  reliability  data  storage  safety  jepsen  testing  aphyr  partition  network-partitions  cap 
may 2015 by jm
In the privacy of your own home
I didn't know about this:
Last spring, as 41,000 runners made their way through the streets of Dublin in the city’s Women’s Mini Marathon, an unassuming redheaded man by the name of Candid Wueest stood on the sidelines with a scanner. He had built it in a couple of hours with $75 worth of parts, and he was using it to surreptitiously pick up data from activity trackers worn on the runners’ wrists. During the race, Wueest managed to collect personal info from 563 racers, including their names, addresses, and passwords, as well as the unique IDs of the devices they were carrying.
dublin  candid-wueest  privacy  data  marathon  running  iot  activity-trackers 
may 2015 by jm
Ask the Decoder: Did I sign up for a global sleep study?
How meaningful is this corporate data science, anyway? Given the tech-savvy people in the Bay Area, Jawbone likely had a very dense sample of Jawbone wearers to draw from for its Napa earthquake analysis. That allowed it to look at proximity to the epicenter of the earthquake from location information.

Jawbone boasts its sample population of roughly “1 million Up wearers who track their sleep using Up by Jawbone.” But when looking into patterns county by county in the U.S., Jawbone states, it takes certain statistical liberties to show granularity while accounting for places where there may not be many Jawbone users.

So while Jawbone data can show us interesting things about sleep patterns across a very large population, we have to remember how selective that population is. Jawbone wearers are people who can afford a $129 wearable fitness gadget and the smartphone or computer to interact with the output from the device.

Jawbone is sharing what it learns with the public, but think of all the public health interests or other third parties that might be interested in other research questions from a large scale data set. Yet this data is not collected with scientific processes and controls and is not treated with the rigor and scrutiny that a scientific study requires.

Jawbone and other fitness trackers don’t give us the option to use their devices while opting out of contributing to the anonymous data sets they publish. Maybe that ought to change.
jawbone  privacy  data-protection  anonymization  aggregation  data  medicine  health  earthquakes  statistics  iot  wearables 
march 2015 by jm
VividCortex uses K-Means Clustering to discover related metrics
After selecting an interesting spike in a metric, the algorithm can automate picking out a selection of other metrics which spiked at the same time. I can see that being pretty damn useful
metrics  k-means-clustering  clustering  algorithms  discovery  similarity  vividcortex  analysis  data 
march 2015 by jm
Can we have medical privacy, cloud computing and genomics all at the same time?
Today sees the publication of a report I [Ross Anderson] helped to write for the Nuffield Bioethics Council on what happens to medical ethics in a world of cloud-based medical records and pervasive genomics.

As the information we gave to our doctors in private to help them treat us is now collected and treated as an industrial raw material, there has been scandal after scandal. From failures of anonymisation through unethical sales to the care.data catastrophe, things just seem to get worse. Where is it all going, and what must a medical data user do to behave ethically?

We put forward four principles. First, respect persons; do not treat their confidential data like were coal or bauxite. Second, respect established human-rights and data-protection law, rather than trying to find ways round it. Third, consult people who’ll be affected or who have morally relevant interests. And fourth, tell them what you’ve done – including errors and security breaches.
ethics  medicine  health  data  care.data  privacy  healthcare  ross-anderson  genomics  data-protection  human-rights 
february 2015 by jm
Excellent example of failed "anonymisation" of a dataset
Fred Logue notes how this failed Mayo TD Michelle Mulherin:
From recent reports it mow appears that the Department of Education is discussing anonymisation of the Primary Online Database with the Data Protection Commissioner. Well someone should ask Mayo TD Michelle Mulherin how anonymisation is working for her.

The Sunday Times reports that Ms Mulherin was the only TD in the Irish parliament on the dates when expensive phone calls were made to a mobile number in Kenya. The details of the calls were released under the Freedom of Information Act in an “anonymised” database. While it must be said the fact that Ms Mulherin was the only TD present on those occasions does not prove she made the calls – the reporting in the press is now raising the possibility that it was her.

From a data protection point of view this is a perfect example of the difficulty with anonymisation. Data protection rules apply to personal data which is defined as data relating to a living individual who is or can be identified from the data or from the data in conjunction with other information. Anonymisation is often cited as a means for processing data outside the scope of data protection law but as Ms Mulherin has discovered individuals can be identified using supposedly anonymised data when analysed in conjunction with other data.

In the case of the mysterious calls to Kenya even though the released information was “anonymised” to protect the privacy of public representatives, the phone log used in combination with the attendance record of public representatives and information on social media was sufficient to identify individuals and at least raise evidence of association between individuals and certain phone calls. While this may be well and good in terms of accounting for abuses of the phone service it also has worrying implications for the ability of public representatives to conduct their business in private.

The bottom line is that anonymisation is very difficult if not impossible as Ms Mulherin has learned to her cost. It certainly is a lot more complex than simply removing names and other identifying features from a single dataset. The more data that there is and the more diverse the sources the greater the risk that individuals can be identified from supposedly anonymised datasets.
data  anonymisation  fred-logue  ireland  michelle-mulherin  tds  kenya  data-protection  privacy 
january 2015 by jm
Facette
Really nice time series dashboarding app. Might consider replacing graphitus with this...
time-series  data  visualisation  graphs  ops  dashboards  facette 
january 2015 by jm
Kimono
'Turn websites into structured APIs from your browser in seconds' -- next-generation web scraping, recommended by conoro
via:conoro  scraping  web  http  kimono  rss  json  csv  data 
january 2015 by jm
When data gets creepy: the secrets we don’t realise we’re giving away | Technology | The Guardian
Very good article around the privacy implications of derived and inferred aggregate metadata from Ben Goldacre.
We are entering an age – which we should welcome with open arms – when patients will finally have access to their own full medical records online. So suddenly we have a new problem. One day, you log in to your medical records, and there’s a new entry on your file: “Likely to die in the next year.” We spend a lot of time teaching medical students to be skilful around breaking bad news. A box ticked on your medical records is not empathic communication. Would we hide the box? Is that ethical? Or are “derived variables” such as these, on a medical record, something doctors should share like anything else?
advertising  ethics  privacy  security  law  data  aggregation  metadata  ben-goldacre 
december 2014 by jm
Only 10% of serious cycling injuries in Ireland were recorded by Gardai
The Bedford Report for the HSE in 2011 showed that only approximately 10% of serious injuries (with hospital admission to a bed) incurred by cyclists in road traffic collisions were recorded by Gardai. If a cyclist is knocked off his/her bike from impact with a motorised vehicle that is a potential criminal offence if serious injury results. Cyclists expect all such RTCs to be properly and fully investigated and recorded with appropriate follow-up. That clearly is not happening at present. Acute hospitals need to document all admission cases arising from cyclist RTCs and inform the Gardai of them.
garda  police  ireland  cycling  injuries  accidents  reporting  data  bedford-report  hse  hospital 
november 2014 by jm
The problem of managing schemas
Good post on the pain of using CSV/JSON as a data interchange format:
eventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.
schema  json  avro  protobuf  csv  data-formats  interchange  data  hadoop  files  file-formats 
november 2014 by jm
Madhumita Venkataramanan: My identity for sale (Wired UK)
If the data aggregators know everything about you -- including biometric data, healthcare history, where you live, where you work, what you do at the weekend, what medicines you take, etc. -- and can track you as an individual, does it really matter that they don't know your _name_? They legally track, and sell, everything else.
As the data we generate about ourselves continues to grow exponentially, brokers and aggregators are moving on from real-time profiling -- they're cross-linking data sets to predict our future behaviour. Decisions about what we see and buy and sign up for aren't made by us any more; they were made long before. The aggregate of what's been collected about us previously -- which is near impossible for us to see in its entirety -- defines us to companies we've never met. What I am giving up without consent, then, is not just my anonymity, but also my right to self-determination and free choice. All I get to keep is my name.
wired  privacy  data-aggregation  identity-theft  future  grim  biometrics  opt-out  healthcare  data  data-protection  tracking 
november 2014 by jm
Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset
A practical demo of "differential privacy" -- allowing public data dumps to happen without leaking privacy, using Laplace noise addition
differential-privacy  privacy  leaks  public-data  open-data  data  nyc  taxis  laplace  noise  randomness 
september 2014 by jm
"Pitfalls of Object Oriented Programming", SCEE R&D
Good presentation discussing "data-oriented programming" -- the concept of optimizing memory access speed by laying out large data in a columnar format in RAM, rather than naively in the default layout that OOP design suggests
columnar  ram  memory  optimization  coding  c++  oop  data-oriented-programming  data  cache  performance 
july 2014 by jm
173 million 2013 NYC taxi rides shared on BigQuery : bigquery
Interesting! (a) there's a subreddit for Google BigQuery, with links to interesting data sets, like this one; (b) the entire 173-million-row dataset for NYC taxi rides in 2013 is available for querying; and (c) the tip percentage histogram is cool.
datasets  bigquery  sql  google  nyc  new-york  taxis  data  big-data  histograms  tipping 
july 2014 by jm
Questioning the Lambda Architecture
Jay Kreps (Kafka, Samza) with a thought-provoking post on the batch/stream-processing dichotomy
jay-kreps  toread  architecture  data  stream-processing  batch  hadoop  storm  lambda-architecture 
july 2014 by jm
NYC generates hash-anonymised data dump, which gets reversed
There are about 1000*26**3 = 21952000 or 22M possible medallion numbers. So, by calculating the md5 hashes of all these numbers (only 24M!), one can completely deanonymise the entire data. Modern computers are fast: so fast that computing the 24M hashes took less than 2 minutes.


(via Bruce Schneier)

The better fix is a HMAC (see http://benlog.com/2008/06/19/dont-hash-secrets/ ), or just to assign opaque IDs instead of hashing.
hashing  sha1  md5  bruce-schneier  anonymization  deanonymization  security  new-york  nyc  taxis  data  big-data  hmac  keyed-hashing  salting 
june 2014 by jm
The MtGox 500
'On March 9th a group posted a data leak, which included the trading history of all MtGox users from April 2011 to November 2013. The graphs below explore the trade behaviors of the 500 highest volume MtGox users from the leaked data set. These are the Bitcoin barons, wealthy speculators, dueling algorithms, greater fools, and many more who took bitcoin to the moon.'
dataviz  stamen  bitcoin  data  leaks  mtgox  greater-fools 
march 2014 by jm
Analyzing Citibike Usage
Abe Stanway crunches the stats on Citibike usage in NYC, compared to the weather data from Wunderground.
data  correlation  statistics  citibike  cycling  nyc  data-science  weather 
march 2014 by jm
Health privacy: formal complaint to ICO
'Light Blue Touchpaper' notes:
Three NGOs have lodged a formal complaint to the Information Commissioner about the fact that PA Consulting uploaded over a decade of UK hospital records to a US-based cloud service. This appears to have involved serious breaches of the UK Data Protection Act 1998 and of multiple NHS regulations about the security of personal health information.


Let's see if ICO can ever do anything useful.... not holding my breath
ico  privacy  data-protection  dpa  nhs  health  data  ross-anderson 
march 2014 by jm
Care.data is in chaos. It breaks my heart | Ben Goldacre
There are people in my profession who think they can ignore this problem. Some are murmuring that this mess is like MMR, a public misunderstanding to be corrected with better PR. They are wrong: it's like nuclear power. Medical data, rarefied and condensed, presents huge power to do good, but it also presents huge risks. When leaked, it cannot be unleaked; when lost, public trust will take decades to regain.

This breaks my heart. I love big medical datasets, I work on them in my day job, and I can think of a hundred life-saving uses for better ones. But patients' medical records contain secrets, and we owe them our highest protection. Where we use them – and we have used them, as researchers, for decades without a leak – this must be done safely, accountably, and transparently. New primary legislation, governing who has access to what, must be written: but that's not enough. We also need vicious penalties for anyone leaking medical records; and HSCIC needs to regain trust, by releasing all documentation on all past releases, urgently. Care.data needs to work: in medicine, data saves lives.
hscic  nhs  care.data  data  privacy  data-protection  medicine  hospitals  pr 
march 2014 by jm
Welcome to Algorithmic Prison - Bill Davidow - The Atlantic
"Computer says no", taken to the next level.
Even if an algorithmic prisoner knows he is in a prison, he may not know who his jailer is. Is he unable to get a loan because of a corrupted file at Experian or Equifax? Or could it be TransUnion? His bank could even have its own algorithms to determine a consumer’s creditworthiness. Just think of the needle-in-a-haystack effort consumers must undertake if they are forced to investigate dozens of consumer-reporting companies, looking for the one that threw them behind algorithmic bars. Now imagine a future that contains hundreds of such companies. A prisoner might not have any idea as to what type of behavior got him sentenced to a jail term. Is he on an enhanced screening list at an airport because of a trip he made to an unstable country, a post on his Facebook page, or a phone call to a friend who has a suspected terrorist friend?
privacy  data  big-data  algorithms  machine-learning  equifax  experian  consumer  society  bill-davidow 
february 2014 by jm
Hospital records of all NHS patients sold to insurers - Telegraph
The 274-page report describes the NHS Hospital Episode Statistics as a “valuable data source in developing pricing assumptions for 'critical illness’ cover.”
It says that by combining hospital data with socio-economic profiles, experts were able to better calculate the likelihood of conditions, with “amazingly” clear forecasts possible for certain diseases, in particular lung cancer.
Phil Booth, from privacy campaign group medConfidential, said: “The language in the document is extraordinary; this isn’t about patients, this is about exploiting a market. Of course any commercial organisation will focus on making a profit – the question is why is the NHS prepared to hand this data over?”
nhs  privacy  data  insurance  uk  politics  data-protection 
february 2014 by jm
Realtime water level data across Ireland
Some very nice Dygraph-based time-series graphs in here, along with open CSV data. Good job!
open-data  water-levels  time-series  data  rivers  ireland  csv 
february 2014 by jm
How to invoke section 4 of the Data Protection Acts in Ireland
One wierd trick to get your personal data (in any format) from any random organisation, for only EUR6.35 and up to 40 days wait! Good to know.
Hospitals and doctors’ offices in Ireland will give a person their medical records if they ask for them. Mostly. Eventually. When they get to it. And, sometimes, if you pay them over €100 (for a large file).

But, like so much else in the legal world, there is a set of magic words you can incant to place a 40 day deadline on the delivery of your papers and limit the cost to €6.35 -- you invoke the Data Protection Acts data access request procedure.
data-protection  privacy  data-retention  dpa-section-4  data  ireland  medical  law  dpa 
february 2014 by jm
How to Name a Baby
some good data (and graphs) on baby names (via Ruth)
via:ruth  babies  naming  graphs  dataviz  data  usa  names 
january 2014 by jm
UK NHS will soon require GPs pass confidential medical data to third parties
Specifically, unanonymised, confidential, patient-identifying data, for purposes of "admin, healthcare planning, and research", to be held indefinitely, via the HSCIC. Opt-outs may be requested, however
opt-out  privacy  medical  data  healthcare  nhs  uk  data-privacy  data-protection 
january 2014 by jm
Factual/drake
a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies. [...] Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS [and S3] support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.


Via Nelson. Looks interesting, although I'd like to see more features around retries, single-executor locking, parallelism, alerting/metrics, and unattended cron-like operation -- those are always the hard part when I wind up coding up a data pump.
make  data  data-pump  drake  via:nelson  pipelines  workflow 
november 2013 by jm
« earlier      
per page:    204080120160

related tags

3d  3d-printing  abortion  abuse  academia  accidents  accuracy  accuweather  acid  activity-trackers  advertising  aggregateiq  aggregation  aging  ai  airbnb  alarming  algorithms  america  analysis  analytics  animation  anonymisation  anonymity  anonymization  anonymous  anti-spam  apache  aphyr  apis  architecture  archival  archive  arson  art  ashleymadison  autocorrelation  automation  avro  aws  babies  bailout  banking  bara  batch  bedford-report  ben-goldacre  ben-stopford  bias  big-data  bigquery  bill-davidow  biometrics  bitcoin  bittorrent  bold  breeding  brein  brexit  bruce-schneier  bst  c++  cables  cache  calm  cambridge-analytica  candid-wueest  cap  care.data  cars  cassandra  cep  charts  chickens  citibike  cjeu  cli  cloud  cloud-storage  cloudera  cluster-inference  clustering  clusters  cms  coding  columnar  columnar-storage  columnar-stores  columns  comments  comparison  competition  compression  concurrency  config  confusion  consistency  consumer  containers  contracting  control  copyright  copysets  correlation  corrupt  corruption  costs  courage  cqrs  crdts  crowdsourcing  cs  csail  csv  culture  cybercrime  cycling  dado  darach-ennis  dashboards  data  data-aggregation  data-engineering  data-formats  data-loss  data-oriented-design  data-oriented-programming  data-privacy  data-processing  data-protection  data-pump  data-retention  data-science  data-structures  data-warehouse  database  databases  dataflow  datamining  datasets  datasift  dataviz  dc  deanonymization  democracy  demographics  depression  design  desktop  deviation  devops  diagrams  differential-privacy  diplomacy  discovery  disk  distcomp  distributed-systems  docker  documents  dogs  dpa  dpa-section-4  dpc  drake  dremel  dri  driving  druid  dublin  dvo  dygraphs  dynamic  dynamic-histograms  dynamodb  earthquakes  ec  ecj  economics  economy  education  elasticsearch  elections  encoding  engineering  equifax  erasure-coding  error-correction  esb  ethics  etsy  eu  europe  event-processing  events  eventual-consistency  evernote  ex-iona  excel  experian  experiments  export  facebook  facette  facts  fail  false-positives  falsehoods  farming  fault-tolerance  feature-selection  features  file-formats  files  firehose  flash  flickr  fmri  forecasting  formats  fps  fraud  fred-benenson  fred-logue  free  free-data  funny  future  g1  game-dev  games  garda  gardai  gc  geek  genetics  genomics  geodata  gilt-groupe  git  github  golang  google  google-photos  government  graph  graphics  graphing  graphql  graphs  greater-fools  grim  grim-meathook-future  guardian  hacks  hadoop  hashing  hbase  hdfs  health  healthcare  heart  heists  hereditary-disorders  histograms  history  hmac  holland  home-office  horizontal-scaling  hospital  hospitals  hscic  hse  http  human-rights  ibm  ico  identity-theft  infographic  infographics  infoviz  inheritance  injuries  input  insurance  integers  interception  interchange  internet  iot  ireland  java  javascript  jawbone  jay-kreps  jepsen  journalism  jq  json  jupyter  jvm  k-means-clustering  kafka  kaggle  kale  kenya  keyed-hashing  kildare-street  kimono  kinect  lambda  lambda-architecture  languages  laplace  law  leaks  lectures  leveldb  liability  library  lidar  linkedin  live  logging  loggly  logs  london  machine-learning  mail  make  malware  mapping  maps  marathon  marshalling  math  maths  mathwashing  max-schrems  maxime-beauchemin  md5  medical  medical-records  medicine  memory  metadata  metrics  michelle-mulherin  microsoft  migrations  miller  millwheel  mit  ml  mobile  mobile-phones  monitoring  mri  ms  mtgox  names  naming  neelie-kroes  nefertiti  netflix  network-partitions  neuroimaging  new-york  news  nhs  node.js  noise  nosql  notebooks  nsa  number-crunching  nyc  nyt  nytimes  obfuscation  objects  oculus  ogl  oireachtas  olap  oop  open  open-access  open-data  open-government  open-source  opengov  opensource  ops  opt-out  optimization  oregon  osi  oss  osso  osx  outages  pacemakers  pachyderm  paldb  papers  parents  parquet  parsing  partition  pas  payg  pdf  penalty-points  performance  persistence  personal-data  philadelphia  phishing  phones  photos  pie-charts  pipelines  piracy  piratebay  pivot-tables  plotly  plots  pod  police  politics  power  pr  pregnancy  prejudice  prepaid  presentation  presentations  press  privacy  pro-choice  pro-life  programming  propaganda  protobuf  public-data  push-technology  python  queueing  race  racism  rackspace  ram  randomness  ranking  real-world  realtime  recessive-mutation  record-shredding  redundancy  reed-solomon  reliability  repair  repealthe8th  replicas  replication  reporting  reproducability  research  retro  rfc4180  rivers  roman  rome  ross-anderson  rss  ruby  running  sado  safe-harbor  safety  salting  san-francisco  sankey  scalability  scaling  scary  schema  schools  science  scl  scraping  search  security  sed  self-driving  service  service-metrics  sexism  sha1  side-data  silicon-valley  silos  sim  similarity  simpledb  sims  skyline  slides  snooping  snowden  socialnetworking  society  software  sorting  spam  spamassassin  spark  spark-streaming  speed  sql  stamen  standards  static-typing  statistics  storage  storm  stream-processing  streaming  streams  strings  stripe  surveillance  surveys  swift  targeting  taxis  tds  telegeography  terraform  testing  tests  text  time-series  tipping  to-get  tobuy  tools  toread  tracking  training  transfer  travel  travelling  treemap  trump  tsd  tsv  tuning  twitter  types  uber  uk  ukip  unicode  unix  us  us-politics  usa  utf-8  vehicles  version-control  via:akohli  via:antoin  via:conor-delaney  via:conoro  via:d2fn  via:destraynor  via:fanf  via:filippo  via:fintanr  via:highscalability  via:hn  via:irish-times  via:jzawodny  via:ldoody  via:mark-russinovitch  via:martin-thompson  via:nelson  via:peakscale  via:pinboard  via:reddit  via:ruth  via:sbtourist  video  viruses  visiting  visual-programming  visualisation  visualization  vividcortex  viz  voldemort  vtech  water-levels  wearables  weather  web  webgl  willamette-river  wired  women-on-web  work  workflow  world  writing  wtf  xs4all  zeromq  ziggo  zip-sim 

Copy this bookmark:



description:


tags: