高级自定义图元指南#
[1]:
import re
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage
import featuretools as ft
from featuretools.primitives import TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset
带有额外参数的图元#
有些特征需要比其他特征更高级的计算。高级特征通常需要额外的参数来帮助输出所需的值。通过自定义图元,您可以使用图元参数来创建高级特征。
字符串计数示例#
在此示例中,您将学习如何创建接受额外参数的自定义图元。您将创建一个图元来计算特定字符串值在文本中出现的次数。
首先,使用 TransformPrimitive
作为基础派生一个新的转换图元类。该图元将文本列作为输入并返回数值列作为输出,因此将输入类型设置为 Woodwork ColumnSchema
,其逻辑类型为 NaturalLanguage
;将返回类型设置为 Woodwork ColumnSchema
,其语义标签为 'numeric'
。特定字符串值是额外参数,因此在 __init__
中将其定义为一个关键字参数。然后,重写 get_function
以返回将计算特征的图元函数。
Featuretools 的图元使用 Woodwork 的 ColumnSchema
来控制图元列的输入和返回类型。有关在 Featuretools 中使用 Woodwork 类型系统的更多信息,请参阅Featuretools 中的 Woodwork 类型指南。
[2]:
class StringCount(TransformPrimitive):
"""Count the number of times the string value occurs."""
name = "string_count"
input_types = [ColumnSchema(logical_type=NaturalLanguage)]
return_type = ColumnSchema(semantic_tags={"numeric"})
def __init__(self, string=None):
self.string = string
def get_function(self):
def string_count(column):
assert self.string is not None, "string to count needs to be defined"
# this is a naive implementation used for clarity
counts = [text.lower().count(self.string) for text in column]
return counts
return string_count
现在您有了一个可用于不同字符串值的可重用图元。例如,您可以根据单词“the”在文本中出现的次数创建特征。创建一个字符串值为“the”的图元实例,并将该图元传递给 DFS 以生成特征。特征名称将自动反映图元的字符串值。
[3]:
es = make_ecommerce_entityset()
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=["sum", "mean", "std"],
trans_primitives=[StringCount(string="the")],
)
feature_matrix[
[
"STD(log.STRING_COUNT(comments, string=the))",
"SUM(log.STRING_COUNT(comments, string=the))",
"MEAN(log.STRING_COUNT(comments, string=the))",
]
]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f091dba9040> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f091dba40d0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f091dba9160> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f091dba40d0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f091dba9040> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f091dba9160> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
[3]:
STD(log.STRING_COUNT(comments, string=the)) | SUM(log.STRING_COUNT(comments, string=the)) | MEAN(log.STRING_COUNT(comments, string=the)) | |
---|---|---|---|
id | |||
0 | 47.124304 | 209.0 | 41.80 |
1 | 36.509131 | 109.0 | 27.25 |
2 | NaN | 29.0 | 29.00 |
3 | 49.497475 | 70.0 | 35.00 |
4 | 0.000000 | 0.0 | 0.00 |
5 | 1.414214 | 4.0 | 2.00 |
具有多个输出的特征#
有些计算会输出多个值。通过自定义图元,您可以充分利用这些计算,为每个输出值创建一个特征。
大小写计数示例#
在此示例中,您将学习如何创建输出多个特征的自定义图元。您将创建一个图元,输出文本中大写和小写字母的计数。
首先,使用 TransformPrimitive
作为基础派生一个新的转换图元类。该图元将文本列作为输入并返回两个数值列作为输出,因此将输入类型设置为 Woodwork ColumnSchema
,其逻辑类型为 NaturalLanguage
;将返回类型设置为 Woodwork ColumnSchema
,其语义标签为 'numeric'
。由于此图元返回两列,因此也将 number_output_features
设置为 2。然后,重写 get_function
以返回将计算特征并返回列列表的图元函数。
[4]:
class CaseCount(TransformPrimitive):
"""Return the count of upper case and lower case letters of a text."""
name = "case_count"
input_types = [ColumnSchema(logical_type=NaturalLanguage)]
return_type = ColumnSchema(semantic_tags={"numeric"})
number_output_features = 2
def get_function(self):
def case_count(array):
# this is a naive implementation used for clarity
upper = np.array([len(re.findall("[A-Z]", i)) for i in array])
lower = np.array([len(re.findall("[a-z]", i)) for i in array])
return upper, lower
return case_count
现在您有一个输出两列的图元。一列包含大写字母的计数。另一列包含小写字母的计数。将该图元传递给 DFS 以生成特征。默认情况下,特征名称将反映输出的索引。
[5]:
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=[],
trans_primitives=[CaseCount],
)
feature_matrix[
[
"customers.CASE_COUNT(favorite_quote)[0]",
"customers.CASE_COUNT(favorite_quote)[1]",
]
]
[5]:
customers.CASE_COUNT(favorite_quote)[0] | customers.CASE_COUNT(favorite_quote)[1] | |
---|---|---|
id | ||
0 | 1.0 | 44.0 |
1 | 1.0 | 44.0 |
2 | 1.0 | 44.0 |
3 | 1.0 | 41.0 |
4 | 1.0 | 41.0 |
5 | 1.0 | 57.0 |
多个输出的自定义命名#
当您创建输出多个特征的图元时,您还可以为每个特征定义自定义命名。
小时正弦和余弦示例#
在此示例中,您将学习如何为多个输出应用自定义命名。您将创建一个图元,输出小时的正弦和余弦。
首先,使用 TransformPrimitive
作为基础派生一个新的转换图元类。该图元将时间索引作为输入并返回两个数值列作为输出。将输入类型设置为 Woodwork ColumnSchema
,其逻辑类型为 Datetime
,语义标签为 'time_index'
。接下来,将返回类型设置为 Woodwork ColumnSchema
,其语义标签为 'numeric'
,并将 number_output_features
设置为 2。然后,重写 get_function
以返回将计算特征并返回列列表的图元函数。此外,重写 generate_names
以返回您定义的特征名称列表。
[6]:
class HourlySineAndCosine(TransformPrimitive):
"""Returns the sine and cosine of the hour."""
name = "hourly_sine_and_cosine"
input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
return_type = ColumnSchema(semantic_tags={"numeric"})
number_output_features = 2
def get_function(self):
def hourly_sine_and_cosine(column):
sine = np.sin(column.dt.hour)
cosine = np.cos(column.dt.hour)
return sine, cosine
return hourly_sine_and_cosine
def generate_names(self, base_feature_names):
name = self.generate_name(base_feature_names)
return f"{name}[sine]", f"{name}[cosine]"
现在您有一个输出两列的图元。一列包含小时的正弦。另一列包含小时的余弦。将该图元传递给 DFS 以生成特征。特征名称将反映您定义的自定义命名。
[7]:
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="log",
agg_primitives=[],
trans_primitives=[HourlySineAndCosine],
)
feature_matrix.head()[
[
"HOURLY_SINE_AND_COSINE(datetime)[sine]",
"HOURLY_SINE_AND_COSINE(datetime)[cosine]",
]
]
[7]:
HOURLY_SINE_AND_COSINE(datetime)[sine] | HOURLY_SINE_AND_COSINE(datetime)[cosine] | |
---|---|---|
id | ||
0 | -0.544021 | -0.839072 |
1 | -0.544021 | -0.839072 |
2 | -0.544021 | -0.839072 |
3 | -0.544021 | -0.839072 |
4 | -0.544021 | -0.839072 |