featuretools.encode_features#

featuretools.encode_features(feature_matrix, features, top_n=10, include_unknown=True, to_encode=None, inplace=False, drop_first=False, verbose=False)[source]#

对类别特征进行编码

参数:
  • feature_matrix (pd.DataFrame) – 特征数据框。

  • features (list[PrimitiveBase]) – feature_matrix 中的特征定义。

  • top_n (intdict[string -> int]) – 要包含的顶部值数量。如果使用 dict[string -> int],则键是特征名称,值是该特征要包含的顶部值数量。如果特征名称不在字典中,则使用默认值 10。

  • include_unknown (pd.DataFrame) – 添加编码未知类别的特征。默认为 True

  • to_encode (list[str]) – 要编码的特征名称列表。不在列表中的特征在输出矩阵中不进行编码,默认为编码所有必要的特征。

  • inplace (bool) – 就地对 feature_matrix 进行编码。默认为 False。

  • drop_first (bool) – 是否通过移除第一个级别来从 k 个类别级别中获得 k-1 个哑变量。默认为 False

  • verbose (str) – 打印进度信息。

返回值:

编码后的 feature_matrix, 编码后的 features

返回类型:

(pd.Dataframe, list)

示例

In [1]: f1 = ft.Feature(es["log"].ww["product_id"])

In [2]: f2 = ft.Feature(es["log"].ww["purchased"])

In [3]: f3 = ft.Feature(es["log"].ww["value"])

In [4]: features = [f1, f2, f3]

In [5]: ids = [0, 1, 2, 3, 4, 5]

In [6]: feature_matrix = ft.calculate_feature_matrix(features, es,
   ...:                                              instance_ids=ids)
   ...: 

In [7]: fm_encoded, f_encoded = ft.encode_features(feature_matrix,
   ...:                                            features)
   ...: 

In [8]: f_encoded
Out[8]: 
[<Feature: product_id = coke zero>,
 <Feature: product_id = car>,
 <Feature: product_id = toothpaste>,
 <Feature: product_id is unknown>,
 <Feature: purchased>,
 <Feature: value>]

In [9]: fm_encoded, f_encoded = ft.encode_features(feature_matrix,
   ...:                                            features, top_n=2)
   ...: 

In [10]: f_encoded
Out[10]: 
[<Feature: product_id = coke zero>,
 <Feature: product_id = car>,
 <Feature: product_id is unknown>,
 <Feature: purchased>,
 <Feature: value>]

In [11]: fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,
   ....:                                            include_unknown=False)
   ....: 

In [12]: f_encoded
Out[12]: 
[<Feature: product_id = coke zero>,
 <Feature: product_id = car>,
 <Feature: product_id = toothpaste>,
 <Feature: purchased>,
 <Feature: value>]

In [13]: fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,
   ....:                                            to_encode=['purchased'])
   ....: 

In [14]: f_encoded
Out[14]: [<Feature: product_id>, <Feature: purchased>, <Feature: value>]

In [15]: fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,
   ....:                                            drop_first=True)
   ....: 

In [16]: f_encoded
Out[16]: 
[<Feature: product_id = coke zero>,
 <Feature: product_id = car>,
 <Feature: product_id is unknown>,
 <Feature: purchased>,
 <Feature: value>]