huge-data-the-biggest   28

How much would it cost to crawl 1 billion sites using rented AWS servers/bandwidth? - Quora
The best way IMHO to do such a crawl would be to recruit a group of say 100-1000 of your friends, and their friends, and write a simple distributed app running in background on their machines, when they sit idle or are lightly used. This way you will be amortizing their monthly broadband bills, with their monthly quotas (e.g. Comcast 250GB) largely unused anyway. I would think that you can get dozens of Mbps of cross bandwidth in such a network, which could do the job in a matter of months.

BTW, if you really meant 1 billion sites, as opposed to pages, multiply the above bills by 100x (average number of pages per site).


There is no need for you to crawl. Someone has already done the job for you. Common Crawl is a periodic crawl of the internet, and the results are stored in Amazon S3. You can directly use the results without any charge for any kink of analysis you want to do.
q-n-a  qra  quixotic  programming  engineering  search  minimum-viable  internet  web  huge-data-the-biggest  howto  init  advice  money  cost-benefit  strategy  scaling-tech  system-design 
10 weeks ago by nhaliday
The United States achieved a 2.0 percent average annual growth rate of real GDP per capita between 1891 and 2007. This paper predicts that growth in the 25 to 40 years after 2007 will be much slower, particularly for the great majority of the population. Future growth will be 1.3 percent per annum for labor productivity in the total economy, 0.9 percent for output per capita, 0.4 percent for real income per capita of the bottom 99 percent of the income distribution, and 0.2 percent for the real disposable income of that group.

The primary cause of this growth slowdown is a set of four headwinds, all of them widely recognized and uncontroversial. Demographic shifts will reduce hours worked per capita, due not just to the retirement of the baby boom generation but also as a result of an exit from the labor force both of youth and prime-age adults. Educational attainment, a central driver of growth over the past century, stagnates at a plateau as the U.S. sinks lower in the world league tables of high school and college completion rates. Inequality continues to increase, resulting in real income growth for the bottom 99 percent of the income distribution that is fully half a point per year below the average growth of all incomes. A projected long-term increase in the ratio of debt to GDP at all levels of government will inevitably lead to more rapid growth in tax revenues and/or slower growth in transfer payments at some point within the next several decades.

There is no need to forecast any slowdown in the pace of future innovation for this gloomy forecast to come true, because that slowdown already occurred four decades ago. In the eight decades before 1972 labor productivity grew at an average rate 0.8 percent per year faster than in the four decades since 1972. While no forecast of a future slowdown of innovation is needed, skepticism is offered here, particularly about the techno-optimists who currently believe that we are at a point of inflection leading to faster technological change. The paper offers several historical examples showing that the future of technology can be forecast 50 or even 100 years in advance and assesses widely discussed innovations anticipated to occur over the next few decades, including medical research, small robots, 3-D printing, big data, driverless vehicles, and oil-gas fracking.

keep in mind, "the world is just atoms" and I think I know some things that Robert J Gordon doesn't
pdf  study  economics  growth-econ  prediction  big-picture  cliometrics  technology  innovation  stagnation  malaise  🎩  econ-productivity  history  mostly-modern  econ-metrics  demographics  education  inequality  monetary-fiscal  debt  government  labor  pessimism  🔬  stylized-facts  huge-data-the-biggest  zero-positive-sum  usa  automation  winner-take-all  murray  energy-resources  the-world-is-just-atoms  trends  current-events  broad-econ  info-dynamics  chart  nihil  zeitgeist  rot  the-bones  cjones-like  speedometer  whiggish-hegelian  flux-stasis  mokyr-allen-mccloskey  microfoundations 
march 2017 by nhaliday
Information Processing: Big, complicated data sets
This Times article profiles Nick Patterson, a mathematician whose career wandered from cryptography, to finance (7 years at Renaissance) and finally to bioinformatics. “I’m a data guy,” Dr. Patterson said. “What I know about is how to analyze big, complicated data sets.”

If you're a smart guy looking for something to do, there are 3 huge computational problems staring you in the face, for which the data is readily accessible.

1) human genome: 3 GB of data in a single genome; most data freely available on the Web (e.g., Hapmap stores patterns of sequence variation). Got a hypothesis about deep human history (evolution)? Test it yourself...

2) market prediction: every market tick available at zero or minimal subscription-service cost. Can you model short term movements? It's never been cheaper to build and test your model!

3) internet search: about 10^3 Terabytes of data (admittedly, a barrier to entry for an individual, but not for a startup). Can you come up with a better way to index or search it? What about peripheral problems like language translation or picture or video search?

The biggest barrier to entry is, of course, brainpower and a few years (a decade?) of concentrated learning. But the necessary books are all in the library :-)

Patterson has worked in 2 of the 3 areas listed above! Substituting crypto for internet search is understandable given his age, our cold war history, etc.
hsu  scitariat  quotes  links  news  org:rec  profile  giants  stories  huge-data-the-biggest  genomics  bioinformatics  finance  crypto  history  britain  interdisciplinary  the-trenches  🔬  questions  genetics  dataset  search  web  internet  scale  commentary  apollonian-dionysian  magnitude  examples  open-problems  big-surf  markets  securities  ORFE  nitty-gritty  quixotic  google  startups  ideas  measure  space-complexity  minimum-viable 
february 2017 by nhaliday
China invents the digital totalitarian state | The Economist
PROGRAMMING CHINA: The Communist Party’s autonomic approach to managing state security:
- The Chinese Communist Party (CCP) has developed a form of authoritarianism that cannot be measured through traditional political scales like reform versus retrenchment. This version of authoritarianism involves both “hard” and “soft” authoritarian methods that constantly act together.
- To describe the social management process, this paper introduces a new analytical framework called China’s “Autonomic Nervous System” (ANS). This approach explains China’s social management process through a complex systems engineering framework. This framework mirrors the CCP’s Leninist way of thinking.
- The framework describes four key parts of social management, visualized through ANS’s “self-configuring,” “self-healing,” “self-optimizing” and “self-protecting” objectives.

China's Social Credit System: An Evolving Practice of Control:
The Chinese government is not the only entity that has access to millions of faces + identifying information. So do Google, Facebook, Instagram, and anyone who has scraped information from similar social networks (e.g., US security services, hackers, etc.).

In light of such ML capabilities it seems clear that anti-ship ballistic missiles can easily target a carrier during the final maneuver phase of descent, using optical or infrared sensors (let alone radar).
China goes all-in on technology the US is afraid to do right.
US won't learn its lesson in time for CRISPR or AI.
Artificial intelligence is developing fast in China. But is it likely to enable the suppression of freedoms? One of China's most successful investors, Neil Shen, has a short answer to that question. Also, Chinese AI companies now have the potential to overtake their Western rivals -- we explain why. Anne McElvoy hosts with The Economist's AI expert, Tom Standage

the dude just stonewalls when asked at 7:50, completely zipped lips
What you’re looking at above is the work of SenseTime, a Chinese computer vision startup. The software in question, called SenseVideo, is a visual scenario analytics system. Basically, it can analyse video footage to pinpoint whether moving objects are humans, cars, or other entities. It’s even sophisticated enough to detect gender, clothing, and the type of vehicle it’s looking at, all in real time.

Even China’s Backwater Cities Are Going Smart:
remember that tweet with the ML readout of Chinese surveilance cameras? Get ready for the future (via @triviumchina)

XI praised the organization and promised to help it beef up its operations (China
- "China will 'help ... 100 developing countries build or upgrade communication systems and crime labs in the next five years'"
- "The Chinese government will establish an international law enforcement institute under the Ministry of Public Security which will train 20,000 police for developing nations in the coming five years"

The Chinese connection to the Zimbabwe 'coup':

China to create national name-and-shame system for ‘deadbeat borrowers’:
Anyone who fails to repay a bank loan will be blacklisted and have their personal details made public

China Snares Innocent and Guilty Alike to Build World’s Biggest DNA Database:
Police gather blood and saliva samples from many who aren’t criminals, including those who forget ID cards, write critically of the state or are just in the wrong place

Many of the ways Chinese police are collecting samples are impermissible in the U.S. In China, DNA saliva swabs or blood samples are routinely gathered from people detained for violations such as forgetting to carry identity cards or writing blogs critical of the state, according to documents from a national police DNA conference in September and official forensic journals.

Others aren’t suspected of any crime. Police target certain groups considered a higher risk to social stability. These include migrant workers and, in one city, coal miners and home renters, the documents show.


In parts of the country, law enforcement has stored DNA profiles with a subject’s other biometric information, including fingerprints, portraits and voice prints, the heads of the DNA program wrote in the Chinese journal Forensic Science and Technology last year. One provincial police force has floated plans to link the data to a person’s information such as online shopping records and entertainment habits, according to a paper presented at the national police DNA conference. Such high-tech files would create more sophisticated versions of paper dossiers that police have long relied on to keep tabs on citizens.

Marrying DNA profiles with real-time surveillance tools, such as monitoring online activity and cameras hooked to facial-recognition software, would help China’s ruling Communist Party develop an all-encompassing “digital totalitarian state,” says Xiao Qiang, adjunct professor at the University of California at Berkeley’s School of Information.


A teenage boy studying in one of the county’s high schools recalled that a policeman came into his class after lunch one day this spring and passed out the collection boxes. Male students were told to clean their mouths, spit into the boxes and place them into envelopes on which they had written their names.


Chinese police sometimes try to draw connections between ethnic background or place of origin and propensity for crime. Police officers in northwestern China’s Ningxia region studied data on local prisoners and noticed that a large number came from three towns. They decided to collect genetic material from boys and men from every clan to bolster the local DNA database, police said at the law-enforcement DNA conference in September.
China is certainly in the lead in the arena of digital-biometric monitoring. Particularly “interesting” is the proposal to merge DNA info with online behavioral profiling.
This is the thing I find the most disenchanting about the current political spectrum. It's all reheated ideas that are a century old, at least. Everyone wants to run our iPhone society with power structures dating to the abacus.
Thank God for the forward-thinking Chinese Communist Party and its high-tech social credit system!

The government thinks "social credit" will fix the country's lack of trust — and the public agrees.

To be Chinese today is to live in a society of distrust, where every opportunity is a potential con and every act of generosity a risk of exploitation. When old people fall on the street, it’s common that no one offers to help them up, afraid that they might be accused of pushing them in the first place and sued. The problem has grown steadily since the start of the country’s economic boom in the 1980s. But only recently has the deficit of social trust started to threaten not just individual lives, but the country’s economy and system of politics as a whole. The less people trust each other, the more the social pact that the government has with its citizens — of social stability and harmony in exchange for a lack of political rights — disintegrates.

All of which explains why Chinese state media has recently started to acknowledge the phenomenon — and why the government has started searching for solutions. But rather than promoting the organic return of traditional morality to reduce the gulf of distrust, the Chinese government has preferred to invest its energy in technological fixes. It’s now rolling out systems of data-driven “social credit” that will purportedly address the problem by tracking “good” and “bad” behavior, with rewards and punishments meted out accordingly. In the West, plans of this sort have tended to spark fears about the reach of the surveillance state. Yet in China, it’s being welcomed by a public fed up of not knowing who to trust.

It’s unsurprising that a system that promises to place a check on unfiltered power has proven popular — although it’s… [more]
news  org:rec  org:biz  china  asia  institutions  government  anglosphere  privacy  civil-liberty  individualism-collectivism  org:anglo  technocracy  authoritarianism  managerial-state  intel  sinosphere  order-disorder  madisonian  orient  protocol  n-factor  internet  domestication  multi  commentary  hn  society  huge-data-the-biggest  unaffiliated  twitter  social  trust  hsu  scitariat  anonymity  computer-vision  gnon  🐸  leviathan  arms  oceans  sky  open-closed  alien-character  dirty-hands  backup  podcast  audio  interview  ai  antidemos  video  org:foreign  ratty  postrat  expansionism  developing-world  debt  corruption  anomie  organizing  dark-arts  alt-inst  org:lite  africa  orwellian  innovation  biotech  enhancement  GWAS  genetics  genomics  trends  education  crime  criminal-justice  criminology  journos-pundits  chart  consumerism  entertainment  within-group  urban-rural  geography  org:mag  modernity  flux-stasis  hmm  comparison  speedometer  reddit  discussion  ssc  mobile  futurism  absolute-relative  apple  scale  cohesion  cooperate-defect  coordinati 
january 2017 by nhaliday

related tags

2016-election  absolute-relative  acm  advanced  advertising  advice  africa  aggregator  ai  algorithms  alien-character  alignment  alt-inst  altruism  analogy  analysis  analytical-holistic  anglosphere  anomie  anonymity  anthropology  antidemos  aphorism  apollonian-dionysian  apple  approximation  arms  arrows  article  asia  audio  authoritarianism  automation  axelrod  backup  behavioral-gen  best-practices  bias-variance  big-picture  big-surf  bio  biodet  bioinformatics  biotech  britain  broad-econ  c(pp)  caching  career  carmack  chart  china  civic  civil-liberty  cjones-like  clinton  cliometrics  cocktail  coding-theory  cohesion  commentary  comparison  computer-vision  concentration-of-measure  confidence  conquest-empire  consumerism  contrarianism  cooperate-defect  coordination  corporation  correctness  corruption  cost-benefit  counting  course  crime  criminal-justice  criminology  critique  crooked  crypto  current-events  cybernetics  dan-luu  dark-arts  data-science  data  database  dataset  dbs  debate  debt  debugging  decision-theory  defense  degrees-of-freedom  demographics  density  desktop  developing-world  differential-privacy  dimensionality  dirty-hands  discussion  distributed  documentation  domestication  dotnet  duality  econ-metrics  econ-productivity  economics  econotariat  eden-heaven  editors  education  egalitarianism-hierarchy  elite  energy-resources  engineering  enhancement  entertainment  environmental-effects  error  essay  ethical-algorithms  examples  expansionism  experiment  expert-experience  expert  exploratory  fermi  fiber  finance  fixed-point  flux-stasis  flynn  free  frontier  futurism  game-theory  gedanken  generalization  genetics  genomics  geography  geometry  giants  git  gnon  gnxp  google  government  gradient-descent  graphs  growth-econ  guide  guilt-shame  gwas  hanson  hardware  hari-seldon  harvard  hashing  heterodox  heuristic  high-dimension  history  hmm  hn  howto  hsu  ideas  identity-politics  identity  idk  individualism-collectivism  inequality  inference  info-dynamics  init  innovation  institutions  intel  interdisciplinary  internet  interpretability  interview  intricacy  jelani-nelson  jobs  journos-pundits  jvm  kinship  knowledge  labor  large-factor  law  lecture-notes  lectures  legibility  len:long  lens  let-me-see  leviathan  libraries  linear-programming  linearity  links  list  local-global  long-term  machine-learning  madisonian  magnitude  malaise  managerial-state  markets  markov  matching  math.ds  matrix-factorization  measure  measurement  medieval  meta:prediction  meta:science  metabuch  methodology  microfoundations  military  minimalism  minimum-viable  mobile  models  modernity  mokyr-allen-mccloskey  monetary-fiscal  money  mostly-modern  multi  murray  n-factor  network-structure  new-religion  news  nibble  nihil  nitty-gritty  noahpinion  noblesse-oblige  objektbuch  ocaml-sml  oceans  online-learning  open-closed  open-problems  opsec  optimization  order-disorder  orfe  org:anglo  org:biz  org:bleg  org:edu  org:foreign  org:lite  org:mag  org:nat  org:popup  org:rec  org:sci  organization  organizing  orient  orwellian  os  outliers  overflow  p:***  p:**  papers  pdf  performance  pessimism  phalanges  planning  pls  podcast  polanyi-marx  politics  popsci  population-genetics  population  postmortem  postrat  pragmatic  prediction  preimage  preprint  princeton  prioritizing  privacy  pro-rata  profile  programming  project  propaganda  proposal  protocol  psychology  q-n-a  qra  questions  quixotic  quotes  ratty  reddit  reference  reflection  regularization  regularizer  regulation  replication  reputation  rhetoric  rigorous-crypto  risk  rot  rust  s-factor  sampling  sanjeev-arora  sapiens  scale  scaling-tech  scitariat  sdp  search  securities  security  shipping  sinosphere  sky  sleuthin  slides  social-psych  social-science  social  society  sociology  software  space-complexity  spatial  speculation  speedometer  spreading  ssc  stackex  stagnation  startups  state-of-art  stats  status  stories  strategy  straussian  street-fighting  study  stylized-facts  sublinear  summary  system-design  tactics  tcs  tcstariat  tech  technocracy  technology  techtariat  terminal  the-bones  the-trenches  the-watchers  the-world-is-just-atoms  things  thinking  threat-modeling  tidbits  time  toolkit  tools  top-n  tradeoffs  trees  trends  trust  twitter  ubiquity  unaffiliated  uncertainty  unintended-consequences  unit  urban-rural  usa  utopia-dystopia  vcs  video  volo-avolo  web  westminster  whiggish-hegelian  whole-partial-many  wiki  winner-take-all  within-group  wonkish  working-stiff  worse-is-better/the-right-thing  yak-shaving  yoga  zeitgeist  zero-positive-sum  🎩  🐸  👳  🔬  🖥 

Copy this bookmark: