摘要:詳細講解記錄在傳送門我在這里只是大概整理我使用過學習過的。這部分先放過,接下講。這種特殊的策略也叫或是,完全忽略詞在文中位置關系。具體在項目中是如下使用。使用技巧來適配大數據集,沒用過,看上去很牛
Feature extraction
詳細講解記錄在 傳送門
我在這里只是大概整理我使用過學習過的api。
Loading features from dicts這個方便提取數據特征,比如我們的數據是dict形式的,里面有city是三種不同城市,就可以one-hot encode。
使用的是 DictVectorizer 這個模塊
>>> measurements = [ ... {"city": "Dubai", "temperature": 33.}, ... {"city": "London", "temperature": 12.}, ... {"city": "San Fransisco", "temperature": 18.}, ... ] >>> from sklearn.feature_extraction import DictVectorizer >>> vec = DictVectorizer() >>> vec.fit_transform(measurements).toarray() array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]]) >>> vec.get_feature_names() ["city=Dubai", "city=London", "city=San Fransisco", "temperature"]
下面官網又舉了個使用例子,是關于pos_window的,詞性這方面我也沒做過,但是我一開始以為的是在講這種方式在這種情況下不行,因為有很多0,但是細看后又覺得不是,希望有人能幫我解答。
以下英文是原文摘抄。
For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
>>> >>> pos_window = [ ... { ... "word-2": "the", ... "pos-2": "DT", ... "word-1": "cat", ... "pos-1": "NN", ... "word+1": "on", ... "pos+1": "PP", ... }, ... # in a real application one would extract many such dictionaries ... ]
This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):
>>> >>> vec = DictVectorizer() >>> pos_vectorized = vec.fit_transform(pos_window) >>> pos_vectorized <1x6 sparse matrix of type "<... "numpy.float64">" with 6 stored elements in Compressed Sparse ... format> >>> pos_vectorized.toarray() array([[ 1., 1., 1., 1., 1., 1.]]) >>> vec.get_feature_names() ["pos+1=PP", "pos-1=NN", "pos-2=DT", "word+1=on", "word-1=cat", "word-2=the"]
As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix by default instead of a numpy.ndarray.
這部分先放過,接下講。
Feature hashingFeatureHasher 這個類使用來高速低占用內存向量化,使用的技術是feature hashing,由于現在還沒怎么接觸這個方面,不細聊了。
基于murmurhash,這個蠻出名的,以前接觸過。由于scipy.sparse的限制,最大的feature個數上限是
$$2^{31}-1$$
Text feature extraction 文本特征提取vectorization ,也就是將文本集合轉化成數字向量。這種特殊的策略也叫 "Bag of words" 或是 "Bag of n-grams",完全忽略詞在文中位置關系。
第一個介紹 CountVectorizer。
>>> from sklearn.feature_extraction.text import CountVectorizer
有很多的參數
>>> vectorizer = CountVectorizer(min_df=1) >>> vectorizer CountVectorizer(analyzer=..."word", binary=False, decode_error=..."strict", dtype=<... "numpy.int64">, encoding=..."utf-8", input=..."content", lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=..."(?u)ww+", tokenizer=None, vocabulary=None)
下面稍微使用一下
>>> corpus = [ ... "This is the first document.", ... "This is the second second document.", ... "And the third one.", ... "Is this the first document?", ... ] >>> X = vectorizer.fit_transform(corpus) >>> X <4x9 sparse matrix of type "<... "numpy.int64">" with 19 stored elements in Compressed Sparse ... format>
結果
>>> vectorizer.get_feature_names() == ( ... ["and", "document", "first", "is", "one", ... "second", "the", "third", "this"]) True >>> X.toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
可以看出這是根據單詞來統計feature個數,屬于one-hot,一般來講不實用。
這個能好點,tf-idf我就不講了,原理很簡單。
下面可貼一個實例,count里面就是計算好了的單詞出現的個數,只有三個單詞。
>>> counts = [[3, 0, 1], ... [2, 0, 0], ... [3, 0, 0], ... [4, 0, 0], ... [3, 2, 0], ... [3, 0, 2]] ... >>> tfidf = transformer.fit_transform(counts) >>> tfidf <6x3 sparse matrix of type "<... "numpy.float64">" with 9 stored elements in Compressed Sparse ... format> >>> tfidf.toarray() array([[ 0.81940995, 0. , 0.57320793], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 0.47330339, 0.88089948, 0. ], [ 0.58149261, 0. , 0.81355169]])
具體在項目中是如下使用。
>>> from sklearn.feature_extraction.text import TfidfVectorizer >>> vectorizer = TfidfVectorizer(min_df=1) >>> vectorizer.fit_transform(corpus)
使用hash技巧來適配大數據集,沒用過,看上去很牛
The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:
the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset. building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner. pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size), it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.
>>> from sklearn.feature_extraction.text import HashingVectorizer >>> hv = HashingVectorizer(n_features=10) >>> hv.transform(corpus)
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/40663.html
摘要:貢獻者飛龍版本最近總是有人問我,把這些資料看完一遍要用多長時間,如果你一本書一本書看的話,的確要用很長時間。為了方便大家,我就把每本書的章節拆開,再按照知識點合并,手動整理了這個知識樹。 Special Sponsors showImg(https://segmentfault.com/img/remote/1460000018907426?w=1760&h=200); 貢獻者:飛龍版...
摘要:研究人員和機器學習的作者對于數學和面向數據的人來說,非常容易使用。這對于機器學習和領域的工作是非常重要的。高級腳本語言非常適合人工智能和機器學習,因為我們可以快速移動并重試。 摘要: 為什么Python會在這股深度學習浪潮中成為編程語言的頭牌?聽聽大牛如何解釋吧! showImg(https://segmentfault.com/img/bV59KD?w=780&h=405); 1.P...
摘要:研究人員和機器學習的作者對于數學和面向數據的人來說,非常容易使用。這對于機器學習和領域的工作是非常重要的。高級腳本語言非常適合人工智能和機器學習,因為我們可以快速移動并重試。 摘要: 為什么Python會在這股深度學習浪潮中成為編程語言的頭牌?聽聽大牛如何解釋吧! showImg(https://segmentfault.com/img/bV59KD?w=780&h=405); 1.P...
閱讀 2128·2021-09-27 14:04
閱讀 1873·2019-08-30 15:55
閱讀 1698·2019-08-30 13:13
閱讀 1065·2019-08-30 13:07
閱讀 2742·2019-08-29 15:20
閱讀 3240·2019-08-29 12:42
閱讀 3324·2019-08-28 17:58
閱讀 3593·2019-08-28 17:56