高级自定义图元指南#

[1]:
import re

import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage

import featuretools as ft
from featuretools.primitives import TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset

带有额外参数的图元#

有些特征需要比其他特征更高级的计算。高级特征通常需要额外的参数来帮助输出所需的值。通过自定义图元,您可以使用图元参数来创建高级特征。

字符串计数示例#

在此示例中,您将学习如何创建接受额外参数的自定义图元。您将创建一个图元来计算特定字符串值在文本中出现的次数。

首先,使用 TransformPrimitive 作为基础派生一个新的转换图元类。该图元将文本列作为输入并返回数值列作为输出,因此将输入类型设置为 Woodwork ColumnSchema,其逻辑类型为 NaturalLanguage;将返回类型设置为 Woodwork ColumnSchema,其语义标签为 'numeric'。特定字符串值是额外参数,因此在 __init__ 中将其定义为一个关键字参数。然后,重写 get_function 以返回将计算特征的图元函数。

Featuretools 的图元使用 Woodwork 的 ColumnSchema 来控制图元列的输入和返回类型。有关在 Featuretools 中使用 Woodwork 类型系统的更多信息,请参阅Featuretools 中的 Woodwork 类型指南。

[2]:
class StringCount(TransformPrimitive):
    """Count the number of times the string value occurs."""

    name = "string_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, string=None):
        self.string = string

    def get_function(self):
        def string_count(column):
            assert self.string is not None, "string to count needs to be defined"
            # this is a naive implementation used for clarity
            counts = [text.lower().count(self.string) for text in column]
            return counts

        return string_count

现在您有了一个可用于不同字符串值的可重用图元。例如,您可以根据单词“the”在文本中出现的次数创建特征。创建一个字符串值为“the”的图元实例,并将该图元传递给 DFS 以生成特征。特征名称将自动反映图元的字符串值。

[3]:
es = make_ecommerce_entityset()

feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=["sum", "mean", "std"],
    trans_primitives=[StringCount(string="the")],
)

feature_matrix[
    [
        "STD(log.STRING_COUNT(comments, string=the))",
        "SUM(log.STRING_COUNT(comments, string=the))",
        "MEAN(log.STRING_COUNT(comments, string=the))",
    ]
]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f091dba9040> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f091dba40d0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f091dba9160> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f091dba40d0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f091dba9040> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f091dba9160> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  to_merge = base_frame.groupby(
[3]:
STD(log.STRING_COUNT(comments, string=the)) SUM(log.STRING_COUNT(comments, string=the)) MEAN(log.STRING_COUNT(comments, string=the))
id
0 47.124304 209.0 41.80
1 36.509131 109.0 27.25
2 NaN 29.0 29.00
3 49.497475 70.0 35.00
4 0.000000 0.0 0.00
5 1.414214 4.0 2.00

具有多个输出的特征#

有些计算会输出多个值。通过自定义图元,您可以充分利用这些计算,为每个输出值创建一个特征。

大小写计数示例#

在此示例中,您将学习如何创建输出多个特征的自定义图元。您将创建一个图元,输出文本中大写和小写字母的计数。

首先,使用 TransformPrimitive 作为基础派生一个新的转换图元类。该图元将文本列作为输入并返回两个数值列作为输出,因此将输入类型设置为 Woodwork ColumnSchema,其逻辑类型为 NaturalLanguage;将返回类型设置为 Woodwork ColumnSchema,其语义标签为 'numeric'。由于此图元返回两列,因此也将 number_output_features 设置为 2。然后,重写 get_function 以返回将计算特征并返回列列表的图元函数。

[4]:
class CaseCount(TransformPrimitive):
    """Return the count of upper case and lower case letters of a text."""

    name = "case_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    number_output_features = 2

    def get_function(self):
        def case_count(array):
            # this is a naive implementation used for clarity
            upper = np.array([len(re.findall("[A-Z]", i)) for i in array])
            lower = np.array([len(re.findall("[a-z]", i)) for i in array])
            return upper, lower

        return case_count

现在您有一个输出两列的图元。一列包含大写字母的计数。另一列包含小写字母的计数。将该图元传递给 DFS 以生成特征。默认情况下,特征名称将反映输出的索引。

[5]:
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=[],
    trans_primitives=[CaseCount],
)

feature_matrix[
    [
        "customers.CASE_COUNT(favorite_quote)[0]",
        "customers.CASE_COUNT(favorite_quote)[1]",
    ]
]
[5]:
customers.CASE_COUNT(favorite_quote)[0] customers.CASE_COUNT(favorite_quote)[1]
id
0 1.0 44.0
1 1.0 44.0
2 1.0 44.0
3 1.0 41.0
4 1.0 41.0
5 1.0 57.0

多个输出的自定义命名#

当您创建输出多个特征的图元时,您还可以为每个特征定义自定义命名。

小时正弦和余弦示例#

在此示例中,您将学习如何为多个输出应用自定义命名。您将创建一个图元,输出小时的正弦和余弦。

首先,使用 TransformPrimitive 作为基础派生一个新的转换图元类。该图元将时间索引作为输入并返回两个数值列作为输出。将输入类型设置为 Woodwork ColumnSchema,其逻辑类型为 Datetime,语义标签为 'time_index'。接下来,将返回类型设置为 Woodwork ColumnSchema,其语义标签为 'numeric',并将 number_output_features 设置为 2。然后,重写 get_function 以返回将计算特征并返回列列表的图元函数。此外,重写 generate_names 以返回您定义的特征名称列表。

[6]:
class HourlySineAndCosine(TransformPrimitive):
    """Returns the sine and cosine of the hour."""

    name = "hourly_sine_and_cosine"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    number_output_features = 2

    def get_function(self):
        def hourly_sine_and_cosine(column):
            sine = np.sin(column.dt.hour)
            cosine = np.cos(column.dt.hour)
            return sine, cosine

        return hourly_sine_and_cosine

    def generate_names(self, base_feature_names):
        name = self.generate_name(base_feature_names)
        return f"{name}[sine]", f"{name}[cosine]"

现在您有一个输出两列的图元。一列包含小时的正弦。另一列包含小时的余弦。将该图元传递给 DFS 以生成特征。特征名称将反映您定义的自定义命名。

[7]:
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="log",
    agg_primitives=[],
    trans_primitives=[HourlySineAndCosine],
)

feature_matrix.head()[
    [
        "HOURLY_SINE_AND_COSINE(datetime)[sine]",
        "HOURLY_SINE_AND_COSINE(datetime)[cosine]",
    ]
]
[7]:
HOURLY_SINE_AND_COSINE(datetime)[sine] HOURLY_SINE_AND_COSINE(datetime)[cosine]
id
0 -0.544021 -0.839072
1 -0.544021 -0.839072
2 -0.544021 -0.839072
3 -0.544021 -0.839072
4 -0.544021 -0.839072