4種方法計算句子相似度

timger 發布于2019-07-31 10:09 / 2492人閱讀

摘要：距離杰卡德系數用于比較有限樣本集之間的相似性與差異性。將字中間加入空格轉化為矩陣求交集求并集計算杰卡德系數你在干啥呢你在干什么呢計算計算矩陣中兩個向量的相似度，即求解兩個向量夾角的余弦值。

Edit Distance

計算兩個字符串之間，由一個轉成另一個所需要的最少編輯次數，次數越多，距離越大，也就越不相關。比如，“xiaoming”和“xiamin”，兩者的轉換需要兩步：

去除‘o’

去除‘g’

所以，次數/距離=2。

!pip install distance

import distance

def edit_distance(s1, s2):
    return distance.levenshtein(s1, s2)

s1 = "xiaoming"
s2 = "xiamin"
print("距離："+str(edit_distance(s1, s2)))

杰卡德系數

用于比較有限樣本集之間的相似性與差異性。Jaccard 系數值越大，樣本相似度越高，計算方式是：兩個樣本的交集除以并集。

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np


def jaccard_similarity(s1, s2):
    def add_space(s):
        return " ".join(list(s))
    
    # 將字中間加入空格
    s1, s2 = add_space(s1), add_space(s2)
    # 轉化為TF矩陣
    cv = CountVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # 求交集
    numerator = np.sum(np.min(vectors, axis=0))
    # 求并集
    denominator = np.sum(np.max(vectors, axis=0))
    # 計算杰卡德系數
    return 1.0 * numerator / denominator


s1 = "你在干啥呢"
s2 = "你在干什么呢"
print(jaccard_similarity(s1, s2))

TF 計算

計算矩陣中兩個向量的相似度，即：求解兩個向量夾角的余弦值。

計算公式：cosθ=a·b/|a|*|b|

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from scipy.linalg import norm

def tf_similarity(s1, s2):
    def add_space(s):
        return " ".join(list(s))
    
    # 將字中間加入空格
    s1, s2 = add_space(s1), add_space(s2)
    # 轉化為TF矩陣
    cv = CountVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # 計算TF系數
    return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))


s1 = "你在干啥呢"
s2 = "你在干什么呢"
print(tf_similarity(s1, s2))

高階模型Bert

Bert的內部結構，請查看從word2vec到bert這篇文章，本篇文章我們只講代碼實現。我們可以下載Bert模型源碼，或者使用TF-HUB的方式使用，本次我們使用下載源碼的方式。
首先，從Github下載源碼，然后下載google預訓練好的模型，我們選擇Bert-base Chinese。

預模型下載后解壓，文件結構如圖：

vocab.txt是訓練時中文文本采用的字典，bert_config.json是BERT在訓練時，可選調整的一些參數。其它文件是模型結構，參數等文件。

準備數據集

修改 processor

class MoveProcessor(DataProcessor):
  """Processor for the move data set ."""
  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1"]

  @classmethod
  def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with tf.gfile.Open(input_file, "r") as f:
            reader = csv.reader(f, delimiter="	", quotechar=quotechar)
            lines = []
            for line in reader:
                lines.append(line)
            return lines
 
  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      guid = "%s-%s" % (set_type, i)
      if set_type == "test":
        text_a = tokenization.convert_to_unicode(line[0])
        label = "0"
      else:
        text_a = tokenization.convert_to_unicode(line[1])
        label = tokenization.convert_to_unicode(line[0])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

修改 processor字典

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)

  processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
      "setest":MoveProcessor
  }

Bert模型訓練

export BERT_BASE_DIR=/Users/xiaomingtai/Downloads/chinese_L-12_H-768_A-12
export MY_DATASET=/Users/xiaomingtai/Downloads/bert_model
python run_classifier.py 
  --data_dir=$MY_DATASET 
  --task_name=setest 
  --vocab_file=$BERT_BASE_DIR/vocab.txt 
  --bert_config_file=$BERT_BASE_DIR/bert_config.json 
  --output_dir=/Users/xiaomingtai/Downloads/ber_model_output/ 
  --do_train=true 
  --do_eval=true 
  --do_predict=true
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=16 
  --eval_batch_size=8
  --predict_batch_size=2
  --learning_rate=5e-5
  --num_train_epochs=3.0

Bert模型訓練結果

GPU云服務器云服務器語義相似度相似度檢測 python相似度人臉識別相似度

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/43331.html

咋做長文本去重

摘要：新問題拋出有沒有一種簽名算法，如果文本非常相似，簽名值也非常相似呢二文本相似性的簽名算法上文提出的問題，可以用局部敏感哈希解決，局部敏感哈希是一類文本越相似，哈希值越相似的算法，有興趣的同學自行百度，這里分享一下的思路。緣起：（1）原創不易，互聯網抄襲成風，很多原創內容在網上被抄來抄去，改來改去（2）百度的網頁庫非常大，爬蟲如何判斷一個新網頁是否與網頁庫中已有的網頁重復呢？這是本文要...

coordinate35 2019-06-28 13:51 評論0 收藏0
4種方法計算句子相似度

摘要：距離杰卡德系數用于比較有限樣本集之間的相似性與差異性。將字中間加入空格轉化為矩陣求交集求并集計算杰卡德系數你在干啥呢你在干什么呢計算計算矩陣中兩個向量的相似度，即求解兩個向量夾角的余弦值。 Edit Distance 計算兩個字符串之間，由一個轉成另一個所需要的最少編輯次數，次數越多，距離越大，也就越不相關。比如，xiaoming和xiamin，兩者的轉換需要兩步：去除‘o’ 去除...

用戶83 2019-06-26 18:49 評論0 收藏0
自然語言處理真實項目實戰

摘要：在自然語言處理中，一個很重要的技術手段就是將文檔轉換為一個矢量，這個過程一般是使用這個庫進行處理的。自然語言處理中，一般來說，代表詞。自然語言預處理中，一個很重要的步驟就是將你收集的句子進行分詞，將一個句子分解成詞的列表。前言本文根據實際項目撰寫，由于項目保密要求，源代碼將進行一定程度的刪減。本文撰寫的目的是進行公司培訓，請勿以任何形式進行轉載。由于是日語項目，用到的分詞軟件等，在...

王巖威 2019-07-30 17:03 評論0 收藏0

發表評論

登陸后可評論

0條評論

timger

男|高級講師

我要關注我要私信

TA的文章

C語言實現簡單小游戲---掃雷

閱讀 2113·2021-11-16 11:45
Vollcloud：雙十一優惠，香港CMI線路，7.5折，100M-200M帶寬，免費更換原生IP

閱讀 1184·2021-10-22 09:53
Krypt九月：云服務器$120/年,2vCPU/2GB/60GB SSD/3TB,支持Window

閱讀 4002·2021-09-07 10:26
初學 go 入門-案例-教程-記錄（4）了解基礎語法，了解運算

閱讀 1209·2021-09-06 15:00
小技巧 - 如何做到讓一個元素在禁用JavaScript時和啟用JavaScript時有不同的樣式

閱讀 2073·2019-08-28 18:09
小程序運行機制前臺/后臺狀態

閱讀 2795·2019-08-26 14:06
解決antd-mobile樣式被postcss轉換的問題

閱讀 3934·2019-08-26 13:48
原生JavaScript事件處理程序匯總

閱讀 1296·2019-08-26 12:11

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！

4種方法計算句子相似度

相關文章

咋做長文本去重