big-data   3490

« earlier    

airflow + ETL + MPP SQL DB + redash/superset
3 days ago by FTS
Big Data Surveillance: The Case of Policing - Sarah Brayne, 2017
Tracks the intertwined growth of surveillance and "big data" within the Los Angeles Police Department through interviews and observation. Identifies five areas of transformation:

1. discretionary risk assessments -> quantified risk assessment: data that is input to these assessments includes the results of "field interviews," creating a feedback loop since "high risk" individuals are stopped for interviews more frequently, but each police contact adds to a person's risk score

2. reactive/explanatory analysis -> predictive analysis (e.g. PredPol)

3. query-based -> alert-based systems: the main example here is Palantir, which allows you to set alerts based on individuals, locations, or other characteristics present in real-time (often high-frequency) databases. Queries themselves become data, as the fact that someone has been searched by other officers using the system can itself be flagged

4. lower database inclusion thresholds: law enforcement databases have expanded beyond individuals who have direct contact with police through arrests or stops -- they now include data collected during stop-and-frisk and risk-based field interviews, the field interview data can include information on who else was with the person of interest even though they did not have any direct police contact, automated license plate readers (ALPR) suck up data constantly

5. integration of different data systems -- merging data across data stores and creating unique identifiers across systems, which governments might be interested in for the purpose of improving service delivery, also transforms the nature of surveillance -- interviewees rave about their Palantir software which lets them see everything in one place. In addition to data from public agencies including law enforcement, social services, health/mental health services, child/family services, the paper also mentions Palantir's constant inclusion of new data sources -- repossession/collections agencies, social media, foreclosure, electronic toll data, utility bills, pay parking lots, fast food call data, university camera feeds, rebate data . . . "In some instances, it is simply eaasier for law enforcement to purchase privately collected data than to rely on in-house data because there are fewer constitutional protections . . .". Much of the newly integrated data suffers from related types of inclusion bias (e.g. your chances of appearing in stop-and-frisk data differs based on race and class, this is also true for usage of social health/family services, etc., and even the placement and usage of ALPRs is based on measured crime rates), so that in all, these systems come to define and mark a population as suspicious (the only responses to queries will be people already in the data in some way).
surveillance  machine-learning  big-data  police  police-data  palantir  lapd  los-angeles 
5 days ago by tarakc02
Fast generalised linear models by database sampling and one-step polishing: Journal of Computational and Graphical Statistics: Vol 0, No ja
"In this note, I show how to fit a generalised linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N12+δ observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car colour in New Zealand."
glm  big-data  fisher-scoring  R  dbglm 
7 days ago by arsyed
Kedro - Python library for building robust production-ready data and analytics pipelines
A workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned.
Python  big-data  machine-learning  opensource  workflow 
7 days ago by liqweed
Apache Hadoop 2.9.2 – HDFS Architecture
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is
storage  big-data  hadoop  file-system  HDFS 
8 days ago by kOoLiNuS
ntop - High Performance Network Monitoring Solutions based on Open Source and Commodity Hardware.
Packet Capture - Wire-speed packet capture/transmission using commodity hardware with PF_RING. Zero-Copy packet distribution across threads, applications, Virtual Machines. Libpcap support for seamless integration with legacy applications.

Traffic Recording - 10 Gbit and above lossless network traffic recording with n2disk. Industry standard PCAP file format. On-the-fly indexing to quickly retrieve interesting packets using fast-BPF and time interval. Precise traffic replay with disk2n.

Network Probe - nProbe: extensible NetFlow v5/v9/IPFIX probe with plugins support for L7 content inspection. nProbe Cento: up to 100 Gbit NetFlow, traffic classification, and packet shunting for IDS/packet-to-disk acceleration.

Traffic Analysis - High-speed web-based traffic analysis and flow collection using ntopng. Persistent traffic statistics in RRD format. Layer 7 analysis by leveraging on nDPI, an Open Source DPI framework.
networking  monitoring  big-data  analytics  security  cool-tools 
22 days ago by liqweed
BigDAMA - Big Data Analytics for Network Traffic Monitoring and Analysis
The complexity of the Internet has dramatically increased in the last few years, making it more important and challenging to design scalable Network Traffic Monitoring and Analysis applications and tools. The Big-DAMA project is conceiving novel scalable techniques and big-data frameworks capable to analyze both online network traffic data streams and offline massive traffic datasets. The team is exploring scalable online and offline data mining and machine learning-based techniques to monitor and characterize extremely large network traffic datasets.
networking  big-data  analytics  machine-learning 
22 days ago by liqweed

« earlier    

related tags

****  ads  advertising  ai  algorithms  analysis  analytics  apache-spark  apache  apple  archives  around-the-web  article  artificial-intelligence  asset-management  astronomy  athena  aws  batch  bayes  benchmarking  bi  bias  big-idea  bigdata  black-holes  bookmarks_bar  business-globalization  business-model  business  cassandra  cgiar  charts  china  clickhouse  comparison  complex-systems  computational-science  conferences  cool-tools  correlation  creative-commons  dask  data-analysis  data-catalog  data-centers  data-curation  data-engineering  data-lake  data-lineage  data-science  data-visualization  data  database  databases  datascience  datasciencetools  dataset  datasets  datastore  dataviz  datenschutz  dawn  dbglm  ddj  dev  digital-economy  distributed  distro  dynamics  economics  education  eht  elastic  elasticsearch  elliott-brennan  epistemology  exploration  facebook  facial-recognition  facility-planning  feb19  file-system  fisher-scoring  foodtank  framework  free  frequentist  fun  future  gamification  glm  glue  graph  grim-meathook-future  grim  hack  hadoop  hbase  hdfs  human-rights  humanities  infovis  infrastructure  internet-of-things  internet  interpretation  java  jornalismo  journalism  js  json  lambda  landscape  lapd  lasso-regression  learning  linear-algebra  list  los-angeles  machine-learning  machinelearning  mapping  maps  math  metadata  ml  monads  monitoring  monopsony  multilayer  multiplex  network-analysis  networking  networks  nlp  nosql  nursing  olap  online  ontology  open  opensource  packages  palantir  papers  parsing  performance  philosophy  police-data  police  policing  predictive-analytics  predictiveanalytics  pricing  prison-camps  privacy  probability  programming  projects  prop-tech  pyspark  python  querying  r  rdbms  realtime  reasoning  reference  regression  research  ridge-regression  sampling  scala  scale  science  search  security  serverless  service  snowflake  social-media  socialmedia  software-development  spark  sparql  sparser  sql  startup  statistics  stats  storage  streaming  suche  suchmaschine  surveillance  symbolic-computation  systems-design  tech  telescopes  this-week-446  this-week-455  to:read  tools  trade-policy  understanding  ut-austin  virtualization  visualization  webarch  webservice  workflow  xinjiang 

Copy this bookmark: