生成特征描述#

随着特征变得越来越复杂,它们的名称也可能变得越来越难以理解。describe_feature 函数和 graph_feature 函数都可以帮助解释特征是什么以及 Featuretools 生成它所采取的步骤。此外,可以通过提供自定义定义和模板来增强 describe_feature 函数,以改进生成的描述。

默认情况下,describe_feature 使用现有的列名和 DataFrame 名称以及默认的原语描述模板来生成特征描述。

[2]:
feature_defs[9]
[2]:
<Feature: MONTH(birthday)>
[3]:
ft.describe_feature(feature_defs[9])
[3]:
'The month of the "birthday".'
[4]:
feature_defs[14]
[4]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[5]:
ft.describe_feature(feature_defs[14])
[5]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'

改进描述#

尽管默认描述很有帮助,但也可以通过提供列和特征的自定义定义以及为原语描述提供替代模板来进一步改进它们。

特征描述#

自定义特征定义将在描述中替代自动生成的描述。这可以用来更好地解释 ColumnSchema 或特征是什么,或者提供利用用户关于数据或领域的现有知识的描述。

[6]:
feature_descriptions = {"customers: join_date": "the date the customer joined"}

ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)
[6]:
'The month of the "birthday".'

例如,上面将列名 "join_date" 替换为更具描述性的定义,说明该列在数据集中代表什么。描述也可以通过 Woodwork 类型信息直接在 DataFrame 的列上设置,以访问每个 ColumnSchema 上存在的 description 属性。

[7]:
join_date_column_schema = es["customers"].ww.columns["join_date"]
join_date_column_schema.description = "the date the customer joined"

es["customers"].ww.columns["join_date"].description
[7]:
'the date the customer joined'
[8]:
feature = ft.TransformFeature(es["customers"].ww["join_date"], ft.primitives.Hour)
feature
[8]:
<Feature: HOUR(join_date)>
[9]:
ft.describe_feature(feature)
[9]:
'The hour value of the date the customer joined.'
注意:如上所述,在 DataFrame 的列上设置描述时,请注意避免通过 ``df.ww[col_name].ww.description`` 来设置描述。使用 ``df.ww[col_name]`` 会创建一个全新的 Series 对象,该对象与构建特征描述所用的 EntitySet 无关。因此,除了通过 ``columns`` 属性之外的任何方式设置描述都不会以能传播到特征描述的方式设置列的描述。

必须在创建特征之前在 DataFrame 的列中设置描述,描述才能传播。请注意,如果描述既直接在列上设置,又通过 feature_descriptions 参数传递给 describe_feature,则 feature_descriptions 参数中的描述将优先。

也可以为生成的特征提供特征描述。

[10]:
feature_descriptions = {
    "sessions: SUM(transactions.amount)": "the total transaction amount for a session"
}

feature_defs[14]
[10]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[11]:
ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)
[11]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'

这里,我们创建并传入中间特征 SUM(transactions.amount) 的自定义描述。MEAN(sessions.SUM(transactions.amount)) 的描述(该特征是在 SUM(transactions.amount) 之上构建的)使用了自定义描述,而不是自动生成的描述。特征描述可以作为字典传入,该字典将自定义描述映射到特征对象本身,或者映射到形如 "[dataframe_name]: [feature_name]" 的唯一特征名称,如上所示。

原语模板#

原语描述是使用原语模板生成的。默认情况下,这些模板使用原语上的 description_template 属性定义。没有模板的原语默认使用原语的 name 属性(如果已定义),或者使用类名(如果未定义)。原语描述模板是字符串模板,它们将输入特征描述作为位置参数。可以通过将原语实例或原语名称映射到自定义模板,并通过 primitive_templates 参数将它们传递给 describe_feature 来覆盖这些默认模板。

[12]:
primitive_templates = {"sum": "the total of {}"}

feature_defs[6]
[12]:
<Feature: SUM(transactions.amount)>
[13]:
ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)
[13]:
'The total of the "amount" of all instances of "transactions" for each "customer_id" in "customers".'

在本例中,我们将默认模板 'the sum of {}' 覆盖为我们的自定义模板 'the total of {}'。描述使用了我们的自定义模板而不是默认模板。

多输出原语可以使用原语描述模板列表来区分通用多输出特征描述和特征切片描述。第一个原语模板始终是通用的整体特征。如果只提供了另一个模板,则将其用作所有切片的模板。转换为“第 n 个”形式的切片编号可通过 nth_slice 关键字获得。

[14]:
feature = feature_defs[5]
feature
[14]:
<Feature: N_MOST_COMMON(transactions.product_id)>
[15]:
primitive_templates = {
    "n_most_common": [
        "the 3 most common elements of {}",  # generic multi-output feature
        "the {nth_slice} most common element of {}",
    ]
}  # template for each slice

ft.describe_feature(feature, primitive_templates=primitive_templates)
[15]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

注意多输出特征如何使用第一个模板进行描述。此特征的每个切片将使用第二个切片模板。

[16]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[16]:
'The 1st most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[17]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[17]:
'The 2nd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[18]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[18]:
'The 3rd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

另外,除了为所有切片提供一个模板外,还可以为每个切片提供模板以进一步自定义输出。请注意,在这种情况下,每个切片都必须有自己的模板。

[19]:
primitive_templates = {
    "n_most_common": [
        "the 3 most common elements of {}",
        "the most common element of {}",
        "the second most common element of {}",
        "the third most common element of {}",
    ]
}

ft.describe_feature(feature, primitive_templates=primitive_templates)
[19]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[20]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[20]:
'The most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[21]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[21]:
'The second most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[22]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[22]:
'The third most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

自定义特征描述和原语模板也可以在 JSON 文件中单独定义,并使用 metadata_file 关键字参数传递给 describe_feature 函数。直接通过 feature_descriptionsprimitive_templates 关键字参数传入的描述将优先于 JSON 元数据文件中提供的任何描述。