特征原语#
特征原语是 Featuretools 的构建块。它们定义了可以应用于原始数据集以创建新特征的独立计算。由于原语仅限制输入和输出数据类型,因此它们可以应用于不同的数据集,并且可以堆叠以创建新的计算。
为什么使用原语?#
人类用于创建特征的潜在函数空间非常广阔。通过将常见的特征工程计算分解为原语组件,我们能够捕捉人类今天创建的特征的底层结构。
原语仅限制输入和输出数据类型。这意味着它们可以用于将一个领域中已知的计算转移到另一个领域。考虑一个数据科学家经常为交易或事件日志数据计算的特征:事件之间的平均时间。这个特征在预测欺诈行为或未来客户参与度方面非常有价值。
DFS 通过堆叠两个原语 "time_since_previous"
和 "mean"
来实现相同的特征
[1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean"],
trans_primitives=["time_since_previous"],
features_only=True,
)
feature_defs
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[1]:
[<Feature: zip_code>,
<Feature: MEAN(transactions.amount)>,
<Feature: TIME_SINCE_PREVIOUS(join_date)>,
<Feature: MEAN(sessions.MEAN(transactions.amount))>,
<Feature: MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))>]
注意
DFS 的原语参数(例如上述示例中的 agg_primitives
和 trans_primitives
)接受内置 Featuretools 原语的 snake_case
、camelCase
或 TitleCase
字符串(即 time_since_previous
、timeSincePrevious
和 TimeSincePrevious
都是可接受的输入)。
注意
当调用 dfs
并设置 features_only=True
时,仅返回特征定义作为输出。默认情况下,此参数设置为 False
。此参数用于在花费时间计算特征矩阵之前快速检查特征定义。
原语的第二个优点是它们可以以参数化的方式快速枚举许多有趣的特征。深度特征合成利用这一点来获得总结自上次事件以来时间的几种不同方法。
[2]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean", "max", "min", "std", "skew"],
trans_primitives=["time_since_previous"],
)
feature_matrix[
[
"MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))",
"MAX(sessions.TIME_SINCE_PREVIOUS(session_start))",
"MIN(sessions.TIME_SINCE_PREVIOUS(session_start))",
"STD(sessions.TIME_SINCE_PREVIOUS(session_start))",
"SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))",
]
]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7fdee075f940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7fdee075bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7fdee075f820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7fdee075f040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7fdee075f940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7fdee075bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7fdee075f040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7fdee075f820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7fdee075f820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7fdee075f940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7fdee075f040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7fdee075bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
[2]:
MEAN(sessions.TIME_SINCE_PREVIOUS(session_start)) | MAX(sessions.TIME_SINCE_PREVIOUS(session_start)) | MIN(sessions.TIME_SINCE_PREVIOUS(session_start)) | STD(sessions.TIME_SINCE_PREVIOUS(session_start)) | SKEW(sessions.TIME_SINCE_PREVIOUS(session_start)) | |
---|---|---|---|---|---|
customer_id | |||||
5 | 1007.500000 | 1170.0 | 715.0 | 157.884451 | -1.507217 |
4 | 999.375000 | 1625.0 | 650.0 | 308.688904 | 1.065177 |
1 | 966.875000 | 1170.0 | 715.0 | 171.754341 | -0.254557 |
3 | 888.333333 | 1170.0 | 650.0 | 177.613813 | 0.434581 |
2 | 725.833333 | 975.0 | 520.0 | 194.638554 | 0.162631 |
聚合原语 vs 转换原语#
在上面的例子中,我们使用了两种类型的原语。
聚合原语: 这些原语接受相关实例作为输入并输出单个值。它们应用于 EntitySet 中的父子关系。例如:"count"
、"sum"
、"avg_time_between"
。
![digraph "COUNT(sessions)" {
graph [bb="0,0,780,119",
rankdir=LR
];
node [label="\N",
shape=box
];
edge [arrowhead=none,
dir=forward,
style=dotted
];
customers [height=1.1389,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>★ customers (target)</B></TD>
</TR>
<TR>
<TD ALIGN="LEFT" port="COUNT(sessions)" BGCOLOR="#D9EAD3">COUNT(sessions)</TD>
</TR>
</TABLE>>,
pos="676.5,59.5",
shape=plaintext,
width=2.875];
sessions [height=1.6528,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>sessions</B></TD>
</TR><TR><TD ALIGN="LEFT" port="session_id">session_id (index)</TD></TR>
<TR><TD ALIGN="LEFT" port="customer_id">customer_id</TD></TR>
</TABLE>>,
pos="82.5,59.5",
shape=plaintext,
width=2.2917];
"COUNT(sessions)_groupby_sessions--customer_id" [height=0.52778,
label="group by
customer_id",
pos="253,40.5",
width=1.4444];
sessions:session_id -> "COUNT(sessions)_groupby_sessions--customer_id" [arrowhead="",
pos="e,200.9,53.958 158.5,58.5 169.02,58.5 180.09,57.387 190.79,55.715",
style=solid];
sessions:customer_id -> "COUNT(sessions)_groupby_sessions--customer_id" [pos="158.5,21.5 172.41,21.5 187.26,23.546 200.94,26.295"];
"0_COUNT(sessions)_count" [height=0.94444,
label=<<FONT POINT-SIZE="12"><B>Aggregation</B><BR></BR></FONT>COUNT>,
pos="439,40.5",
shape=diamond,
width=2.7222];
"0_COUNT(sessions)_count" -> customers:"COUNT(sessions)" [arrowhead="",
pos="e,580.5,40.5 537.12,40.5 548.14,40.5 559.32,40.5 570.19,40.5",
style=solid];
"COUNT(sessions)_groupby_sessions--customer_id" -> "0_COUNT(sessions)_count" [arrowhead="",
pos="e,340.88,40.5 305.25,40.5 313.37,40.5 322.01,40.5 330.84,40.5",
style=solid];
}](../_images/graphviz-d1a3a423db929f367290a82ec9b3384e7b997683.png)
转换原语: 这些原语接受 dataframe 中的一列或多列作为输入,并为该 dataframe 输出一个新列。它们应用于单个 dataframe。例如:"hour"
、"time_since_previous"
、"absolute"
。
![digraph "TIME_SINCE_PREVIOUS(join_date)" {
graph [bb="0,0,721,119",
rankdir=LR
];
node [label="\N",
shape=box
];
edge [arrowhead=none,
dir=forward,
style=dotted
];
customers [height=1.6528,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>★ customers (target)</B></TD>
</TR><TR><TD ALIGN="LEFT" port="join_date">join_date</TD></TR>
<TR>
<TD ALIGN="LEFT" port="TIME_SINCE_PREVIOUS(join_date)" BGCOLOR="#D9EAD3">TIME_SINCE_PREVIOUS(join_date)</TD>
</TR>
</TABLE>>,
pos="146.5,59.5",
shape=plaintext,
width=4.0694];
"0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous" [height=0.94444,
label=<<FONT POINT-SIZE="12"><B>Transform</B><BR></BR></FONT>TIME_SINCE_PREVIOUS>,
pos="525,40.5",
shape=diamond,
width=5.4444];
customers:join_date -> "0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous" [arrowhead="",
pos="e,402.4,53.269 286.5,58.5 320.84,58.5 357.9,56.606 392.29,54.046",
style=solid];
"0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous" -> customers:"TIME_SINCE_PREVIOUS(join_date)" [arrowhead="",
pos="e,286.5,21.5 405.15,27.252 370.54,24.311 332.38,21.941 296.54,21.555",
style=solid];
}](../_images/graphviz-be1551bf7269baa497671ebc41e6f3f2a52051ef.png)
上面的图是使用 graph_feature
函数生成的。这些特征血缘图有助于直观地显示原语如何堆叠以生成特征。
要获取一个列出并描述 Featuretools 中每个内置原语的 DataFrame,请调用 ft.list_primitives()
。
[3]:
ft.list_primitives().head(5)
[3]:
名称 | 类型 | 描述 | 有效输入 | 返回类型 | |
---|---|---|---|---|---|
0 | 方差 | 聚合 | 计算数字列表的方差。 | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = Double) (Semanti... |
1 | 唯一数量 | 聚合 | 确定不同值的数量,忽略... | <ColumnSchema (Semantic Tags = ['category'])> | <ColumnSchema (Logical Type = IntegerNullable)... |
2 | 峰值数量 | 聚合 | 确定数字列表中峰值的数量... | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = Integer) (Semant... |
3 | 第一个 | 聚合 | 确定列表中的第一个值。 | <ColumnSchema> | 无 |
4 | 众数 | 聚合 | 确定最常重复的值。 | <ColumnSchema (Semantic Tags = ['category'])> | 无 |
要获取一个总结 Featuretools 中所有内置原语的各种属性和能力的指标 DataFrame,请调用 ft.summarize_primitives()
。
[4]:
ft.summarize_primitives()
[4]:
指标 | 计数 | |
---|---|---|
0 | 原语总数 | 203 |
1 | 聚合原语数量 | 65 |
2 | 转换原语数量 | 138 |
3 | 唯一输入类型数量 | 23 |
4 | 唯一输出类型数量 | 22 |
5 | 使用多输入 | 50 |
6 | 使用多输出 | 2 |
7 | 使用外部数据 | 1 |
8 | 可控 | 87 |
9 | 使用地址输入 | 0 |
10 | 使用年龄输入 | 0 |
11 | 使用小数年龄输入 | 0 |
12 | 使用可空年龄输入 | 0 |
13 | 使用布尔输入 | 18 |
14 | 使用可空布尔输入 | 12 |
15 | 使用类别输入 | 0 |
16 | 使用国家代码输入 | 0 |
17 | 使用货币代码输入 | 0 |
18 | 使用日期时间输入 | 68 |
19 | 使用双精度输入 | 4 |
20 | 使用电子邮件地址输入 | 2 |
21 | 使用文件路径输入 | 1 |
22 | 使用 IP 地址输入 | 0 |
23 | 使用整数输入 | 4 |
24 | 使用可空整数输入 | 0 |
25 | 使用经纬度输入 | 6 |
26 | 使用自然语言输入 | 17 |
27 | 使用有序输入 | 4 |
28 | 使用人全名输入 | 3 |
29 | 使用电话号码输入 | 0 |
30 | 使用邮政编码输入 | 2 |
31 | 使用子区域代码输入 | 0 |
32 | 使用时间差输入 | 0 |
33 | 使用 URL 输入 | 3 |
34 | 使用未知输入 | 0 |
35 | 使用 numeric 标签输入 | 87 |
36 | 使用 category 标签输入 | 11 |
37 | 使用 index 标签输入 | 1 |
38 | 使用 time_index 标签输入 | 29 |
39 | 使用 date_of_birth 标签输入 | 1 |
40 | 使用 ignore 标签输入 | 0 |
41 | 使用 passthrough 标签输入 | 0 |
42 | 使用 foreign_key 标签输入 | 1 |
定义自定义原语#
Featuretools 中的原语库不断扩展。用户可以使用以下 API 定义自己的原语。要定义一个原语,用户需要:
指定原语类型:
Aggregation
或Transform
定义输入和输出数据类型
编写一个 Python 函数进行计算
使用属性进行标注以限制其应用方式
一旦定义了原语,它就可以与现有原语堆叠以生成复杂的模式。这使得在一个领域被认为重要的原语可以自动转移到另一个领域。
[5]:
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage
from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset
简单自定义原语#
[6]:
class Absolute(TransformPrimitive):
name = "absolute"
input_types = [ColumnSchema(semantic_tags={"numeric"})]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def absolute(column):
return abs(column)
return absolute
上面,我们创建了一个新的转换原语,可以通过使用 TransformPrimitive
作为基类派生新的原语类并重写 get_function
以返回计算特征的函数来用于深度特征合成。此外,我们设置了原语适用的输入数据类型和返回数据类型。输入和返回数据类型是使用 Woodwork ColumnSchema 定义的。有关 Woodwork 逻辑类型和语义标签的完整指南可以在 Woodwork 理解逻辑类型和语义标签 指南中找到。
类似地,我们可以使用 AggregationPrimitive
创建一个新的聚合原语。
[7]:
class Maximum(AggregationPrimitive):
name = "maximum"
input_types = [ColumnSchema(semantic_tags={"numeric"})]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def maximum(column):
return max(column)
return maximum
由于我们定义了一个聚合原语,该函数接受一个值列表作为输入,但只返回一个值。
现在我们已经定义了两个原语,我们可以像使用内置原语一样将它们与 dfs 函数一起使用。
[8]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=[Maximum],
trans_primitives=[Absolute],
max_depth=2,
)
feature_matrix.head(5)[
[
"customers.MAXIMUM(transactions.amount)",
"MAXIMUM(transactions.ABSOLUTE(amount))",
]
]
[8]:
customers.MAXIMUM(transactions.amount) | MAXIMUM(transactions.ABSOLUTE(amount)) | |
---|---|---|
session_id | ||
1 | 146.81 | 141.66 |
2 | 149.02 | 135.25 |
3 | 149.95 | 147.73 |
4 | 139.43 | 129.00 |
5 | 149.95 | 139.20 |
词计数示例#
这里我们定义一个转换原语 WordCount
,它计算输入每一行中的词数,并返回一个词数列表。
[9]:
class WordCount(TransformPrimitive):
"""
Counts the number of words in each row of the column. Returns a list
of the counts for each row.
"""
name = "word_count"
input_types = [ColumnSchema(logical_type=NaturalLanguage)]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def word_count(column):
word_counts = []
for value in column:
words = value.split(None)
word_counts.append(len(words))
return word_counts
return word_count
[10]:
es = make_ecommerce_entityset()
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=["sum", "mean", "std"],
trans_primitives=[WordCount],
)
feature_matrix[
[
"customers.WORD_COUNT(favorite_quote)",
"STD(log.WORD_COUNT(comments))",
"SUM(log.WORD_COUNT(comments))",
"MEAN(log.WORD_COUNT(comments))",
]
]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7fdee075b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7fdee075f940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7fdee075f820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7fdee075f940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7fdee075f820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7fdee075b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
[10]:
customers.WORD_COUNT(favorite_quote) | STD(log.WORD_COUNT(comments)) | SUM(log.WORD_COUNT(comments)) | MEAN(log.WORD_COUNT(comments)) | |
---|---|---|---|---|
id | ||||
0 | 9.0 | 540.436860 | 2500.0 | 500.0 |
1 | 9.0 | 583.702550 | 1732.0 | 433.0 |
2 | 9.0 | NaN | 246.0 | 246.0 |
3 | 6.0 | 883.883476 | 1256.0 | 628.0 |
4 | 6.0 | 0.000000 | 9.0 | 3.0 |
5 | 12.0 | 19.798990 | 68.0 | 34.0 |
通过也添加一些聚合原语,深度特征合成能够从一个新原语创建出四个新特征。
多种输入类型#
如果一个原语需要多个特征作为输入,input_types
将包含多个元素,例如 [ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})]
意味着该原语需要两个带有语义标签 numeric
的列作为输入。下面是一个具有多个输入特征的原语示例。
[11]:
class MeanSunday(AggregationPrimitive):
"""
Finds the mean of non-null values of a feature that occurred on Sundays
"""
name = "mean_sunday"
input_types = [
ColumnSchema(semantic_tags={"numeric"}),
ColumnSchema(logical_type=Datetime),
]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def mean_sunday(numeric, datetime):
days = pd.DatetimeIndex(datetime).weekday.values
df = pd.DataFrame({"numeric": numeric, "time": days})
return df[df["time"] == 6]["numeric"].mean()
return mean_sunday
[12]:
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=[MeanSunday],
trans_primitives=[],
max_depth=1,
)
feature_matrix[
[
"MEAN_SUNDAY(log.value, datetime)",
"MEAN_SUNDAY(log.value_2, datetime)",
]
]
[12]:
MEAN_SUNDAY(log.value, datetime) | MEAN_SUNDAY(log.value_2, datetime) | |
---|---|---|
id | ||
0 | NaN | NaN |
1 | NaN | NaN |
2 | NaN | NaN |
3 | 2.5 | 1.0 |
4 | 7.0 | 3.0 |
5 | NaN | NaN |