高级自定义图元指南#

[1]:

import re

import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage

import featuretools as ft
from featuretools.primitives import TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset

带有额外参数的图元#

有些特征需要比其他特征更高级的计算。高级特征通常需要额外的参数来帮助输出所需的值。通过自定义图元，您可以使用图元参数来创建高级特征。

字符串计数示例#

在此示例中，您将学习如何创建接受额外参数的自定义图元。您将创建一个图元来计算特定字符串值在文本中出现的次数。

首先，使用 TransformPrimitive 作为基础派生一个新的转换图元类。该图元将文本列作为输入并返回数值列作为输出，因此将输入类型设置为 Woodwork ColumnSchema，其逻辑类型为 NaturalLanguage；将返回类型设置为 Woodwork ColumnSchema，其语义标签为 'numeric'。特定字符串值是额外参数，因此在 __init__ 中将其定义为一个关键字参数。然后，重写 get_function 以返回将计算特征的图元函数。

Featuretools 的图元使用 Woodwork 的 ColumnSchema 来控制图元列的输入和返回类型。有关在 Featuretools 中使用 Woodwork 类型系统的更多信息，请参阅Featuretools 中的 Woodwork 类型指南。

[2]:

class StringCount(TransformPrimitive):
    """Count the number of times the string value occurs."""

    name = "string_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, string=None):
        self.string = string

    def get_function(self):
        def string_count(column):
            assert self.string is not None, "string to count needs to be defined"
            # this is a naive implementation used for clarity
            counts = [text.lower().count(self.string) for text in column]
            return counts

        return string_count

现在您有了一个可用于不同字符串值的可重用图元。例如，您可以根据单词“the”在文本中出现的次数创建特征。创建一个字符串值为“the”的图元实例，并将该图元传递给 DFS 以生成特征。特征名称将自动反映图元的字符串值。

[3]:

es = make_ecommerce_entityset()

feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=["sum", "mean", "std"],
    trans_primitives=[StringCount(string="the")],
)

feature_matrix[
    [
        "STD(log.STRING_COUNT(comments, string=the))",
        "SUM(log.STRING_COUNT(comments, string=the))",
        "MEAN(log.STRING_COUNT(comments, string=the))",
    ]
]

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f091dba9040> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f091dba40d0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f091dba9160> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f091dba40d0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f091dba9040> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f091dba9160> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  to_merge = base_frame.groupby(

[3]:

	STD(log.STRING_COUNT(comments, string=the))	SUM(log.STRING_COUNT(comments, string=the))	MEAN(log.STRING_COUNT(comments, string=the))
id
0	47.124304	209.0	41.80
1	36.509131	109.0	27.25
2	NaN	29.0	29.00
3	49.497475	70.0	35.00
4	0.000000	0.0	0.00
5	1.414214	4.0	2.00

具有多个输出的特征#

有些计算会输出多个值。通过自定义图元，您可以充分利用这些计算，为每个输出值创建一个特征。

大小写计数示例#

在此示例中，您将学习如何创建输出多个特征的自定义图元。您将创建一个图元，输出文本中大写和小写字母的计数。

首先，使用 TransformPrimitive 作为基础派生一个新的转换图元类。该图元将文本列作为输入并返回两个数值列作为输出，因此将输入类型设置为 Woodwork ColumnSchema，其逻辑类型为 NaturalLanguage；将返回类型设置为 Woodwork ColumnSchema，其语义标签为 'numeric'。由于此图元返回两列，因此也将 number_output_features 设置为 2。然后，重写 get_function 以返回将计算特征并返回列列表的图元函数。

[4]:

class CaseCount(TransformPrimitive):
    """Return the count of upper case and lower case letters of a text."""

    name = "case_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    number_output_features = 2

    def get_function(self):
        def case_count(array):
            # this is a naive implementation used for clarity
            upper = np.array([len(re.findall("[A-Z]", i)) for i in array])
            lower = np.array([len(re.findall("[a-z]", i)) for i in array])
            return upper, lower

        return case_count

现在您有一个输出两列的图元。一列包含大写字母的计数。另一列包含小写字母的计数。将该图元传递给 DFS 以生成特征。默认情况下，特征名称将反映输出的索引。

[5]:

feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=[],
    trans_primitives=[CaseCount],
)

feature_matrix[
    [
        "customers.CASE_COUNT(favorite_quote)[0]",
        "customers.CASE_COUNT(favorite_quote)[1]",
    ]
]

[5]:

	customers.CASE_COUNT(favorite_quote)[0]	customers.CASE_COUNT(favorite_quote)[1]
id
0	1.0	44.0
1	1.0	44.0
2	1.0	44.0
3	1.0	41.0
4	1.0	41.0
5	1.0	57.0

多个输出的自定义命名#

当您创建输出多个特征的图元时，您还可以为每个特征定义自定义命名。

小时正弦和余弦示例#

在此示例中，您将学习如何为多个输出应用自定义命名。您将创建一个图元，输出小时的正弦和余弦。

首先，使用 TransformPrimitive 作为基础派生一个新的转换图元类。该图元将时间索引作为输入并返回两个数值列作为输出。将输入类型设置为 Woodwork ColumnSchema，其逻辑类型为 Datetime，语义标签为 'time_index'。接下来，将返回类型设置为 Woodwork ColumnSchema，其语义标签为 'numeric'，并将 number_output_features 设置为 2。然后，重写 get_function 以返回将计算特征并返回列列表的图元函数。此外，重写 generate_names 以返回您定义的特征名称列表。

[6]:

class HourlySineAndCosine(TransformPrimitive):
    """Returns the sine and cosine of the hour."""

    name = "hourly_sine_and_cosine"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    number_output_features = 2

    def get_function(self):
        def hourly_sine_and_cosine(column):
            sine = np.sin(column.dt.hour)
            cosine = np.cos(column.dt.hour)
            return sine, cosine

        return hourly_sine_and_cosine

    def generate_names(self, base_feature_names):
        name = self.generate_name(base_feature_names)
        return f"{name}[sine]", f"{name}[cosine]"

现在您有一个输出两列的图元。一列包含小时的正弦。另一列包含小时的余弦。将该图元传递给 DFS 以生成特征。特征名称将反映您定义的自定义命名。

[7]:

feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="log",
    agg_primitives=[],
    trans_primitives=[HourlySineAndCosine],
)

feature_matrix.head()[
    [
        "HOURLY_SINE_AND_COSINE(datetime)[sine]",
        "HOURLY_SINE_AND_COSINE(datetime)[cosine]",
    ]
]

[7]:

	HOURLY_SINE_AND_COSINE(datetime)[sine]	HOURLY_SINE_AND_COSINE(datetime)[cosine]
id
0	-0.544021	-0.839072
1	-0.544021	-0.839072
2	-0.544021	-0.839072
3	-0.544021	-0.839072
4	-0.544021	-0.839072

目录

上一主题

下一主题

此页面

高级自定义图元指南#

带有额外参数的图元#

字符串计数示例#

具有多个输出的特征#

大小写计数示例#

多个输出的自定义命名#

小时正弦和余弦示例#

目录

上一主题

下一主题

此页面

快速搜索

高级自定义图元指南#

带有额外参数的图元#

字符串计数示例#

具有多个输出的特征#

大小写计数示例#

多个输出的自定义命名#

小时正弦和余弦示例#