tf-idf   111

« earlier    

ezelikman/Context-Is-Everything: Official details for: [1803.08493] Context is Everything: Finding Meaning Statistically in Semantic Spaces
This paper introduces Contextual Salience (CoSal), a simple and explicit measure of a word's importance in context which is a more theoretically natural, practically simpler, and more accurate replacement to tf-idf. CoSal supports very small contexts (20 or more sentences), out-of context words, and is easy to calculate. A word vector space generated with both bigram phrases and unigram tokens reveals that contextually significant words disproportionately define phrases. This relationship is applied to produce simple weighted bag-of-words sentence embeddings. This model outperforms SkipThought and the best models trained on unordered sentences in most tests in Facebook's SentEval, beats tf-idf on all available tests, and is generally comparable to the state of the art. This paper also applies CoSal to sentence and document summarization and an improved and context-aware cosine distance. Applying the premise that unexpected words are important, CoSal is presented as a replacement for tf-idf and an intuitive measure of contextual word importance.
ML  papers  tf-idf 
may 2018 by foodbaby
Super Fast String Matching in Python
Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams as terms to find similar strings transforms the problem into a matrix multiplication problem, which is computationally much cheaper. via Pocket
IFTTT  Pocket  fuzzy  ngrams  python  strings  tf-idf 
october 2017 by booyaa

« earlier    

related tags

!  630dh  @seniorproject  ai  algorithm  algorithms  algorythm  analyse  analysis  archive  archive4j  article  artificialintelligence  bagofwords  bm25  bookmarks_bar  burstiness  by:christopher-moody  by:karpathy  cf  classification  clojure  clustering  code  compsci  congress  connectionmachine  corpus  cosine-similarity  cosine  cosine_similarity  couchdb  coursera  cross-entropy  cypher  data-mining  data-science  data  data_mining  data_science  datamining  datawrangling  direct-indexing  dirichlet  distance  distributionalsemantics  document-classification  document  dom  elasticsearch  embeddings  esa  feature-extraction  ferret  finance  fisher-kernel  follow-up  frequency  fuzzy  gapped-q-grams  generator  gensim  gephi  gist  glimmer  go  google-patents  google  googlecode  gotchas  grand-unified-theory  graph  hashing  hashtag  history  howto  html  i2icf  idf  ifttt  impotant  indexer  indexing  information-retrieval  information_retrieval  inverse_document_frequency  ir  java  journalism  karen-spärck-jones  language-model  language  latent-dirichlet-allocation  latent-semantic-indexing  later  lda  learning  lib  libs  linguistics  lsa  lsi  lucene  lyft  machine-learning  machine  machine_learning  machinelearning  mapping  math  mathematics  mg4j  mining  ml  module  mooc  my:mblondel  n-gram  n-grams  naive-bayes  natural  neo4j  network  ngram  ngrams  nlp  nltk  opensource  oss  overview  package  pagerank  papers  parser  parsing  plsa  pocket  probability  processing  programming  propublica  python  qz  r  rank  ranking  recipe  refresh:1  regexp  relations  relativeness  rio20  risk  ruby  sax  scala  scikit  seam-carving  search  searching  semantic  sentiment  sentimentanalysis  seo  sidf  similarity  slides  sna  solr  sourcecode  spam  speech  spider  statistics  strings  sublinear  t-sne  tag  term  text-sim  text  text_analysis  text_minig  textanalysis  textfeatureextraction  textmining  tf_idf  tfidf  theory  time-series  tokens  tools  topic  topic_model  topical  toplaywith  tsne  tutorial  tw  twitter  vector  vector_space  vector_space_model  video  visualization  vsm  web  wikipedia  word  word2vec  wordnet  words    主题模型  余弦相似度  向量空间模型  推荐系统  文本分析  文本相似度  文档相似度  浅层语义分析  浅层语义索引  自然语言处理  课程图谱 

Copy this bookmark: