深度特征合成#
深度特征合成 (DFS) 是一种自动化方法,用于对关系和时间数据执行特征工程。
输入数据#
深度特征合成需要结构化数据集才能执行特征工程。为了展示 DFS 的功能,我们将使用一个模拟客户交易数据集。
注意
在使用 DFS 之前,建议您将数据准备为 EntitySet
。请参阅 使用 EntitySet 表示数据 了解如何操作。
[1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
es
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[1]:
Entityset: transactions
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 3]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 5]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
数据准备为 .EntitySet
后,我们就可以为目标 DataFrame(例如 customers
)自动生成特征。
运行 DFS#
通常,在没有自动化特征工程的情况下,数据科学家会编写代码来聚合客户数据,并应用不同的统计函数,从而生成量化客户行为的特征。在此示例中,专家可能对诸如:总会话数 或 客户注册月份 等特征感兴趣。
当我们将 target_dataframe 指定为 customers
,并将 "count"
和 "month"
指定为原语时,DFS 可以生成这些特征。
[2]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["count"],
trans_primitives=["month"],
max_depth=1,
)
feature_matrix
[2]:
zip_code | COUNT(sessions) | MONTH(birthday) | MONTH(join_date) | |
---|---|---|---|---|
customer_id | ||||
5 | 60091 | 6 | 7 | 7 |
4 | 60091 | 8 | 8 | 4 |
1 | 60091 | 8 | 7 | 4 |
3 | 13244 | 6 | 11 | 8 |
2 | 13244 | 7 | 8 | 4 |
在上面的示例中,"count"
是一个 聚合原语,因为它根据与一个客户相关的多个会话计算出一个单一值。"month"
被称为 转换原语,因为它获取客户的一个值并将其转换为另一个值。
注意
特征原语是 Featuretools 的基本组成部分。要了解更多信息,请阅读 特征原语。
创建“深度特征”#
深度特征合成这个名称源于算法能够堆叠原语以生成更复杂特征的能力。每次堆叠一个原语,都会增加特征的“深度”。max_depth
参数控制 DFS 返回特征的最大深度。让我们尝试运行 max_depth=2
的 DFS。
[3]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean", "sum", "mode"],
trans_primitives=["month", "hour"],
max_depth=2,
)
feature_matrix
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
[3]:
zip_code | MODE(sessions.device) | MEAN(transactions.amount) | MODE(transactions.product_id) | SUM(transactions.amount) | HOUR(birthday) | HOUR(join_date) | MONTH(birthday) | MONTH(join_date) | MEAN(sessions.MEAN(transactions.amount)) | MEAN(sessions.SUM(transactions.amount)) | MODE(sessions.HOUR(session_start)) | MODE(sessions.MODE(transactions.product_id)) | MODE(sessions.MONTH(session_start)) | SUM(sessions.MEAN(transactions.amount)) | MODE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | ||||||||||||||||
5 | 60091 | mobile | 80.375443 | 5 | 6349.66 | 0 | 5 | 7 | 7 | 78.705187 | 1058.276667 | 0 | 3 | 1 | 472.231119 | mobile |
4 | 60091 | mobile | 80.070459 | 2 | 8727.68 | 0 | 20 | 8 | 4 | 81.207189 | 1090.960000 | 1 | 1 | 1 | 649.657515 | mobile |
1 | 60091 | mobile | 71.631905 | 4 | 9025.62 | 0 | 10 | 7 | 4 | 72.774140 | 1128.202500 | 6 | 4 | 1 | 582.193117 | mobile |
3 | 13244 | desktop | 67.060430 | 1 | 6236.62 | 0 | 15 | 11 | 8 | 67.539577 | 1039.436667 | 5 | 1 | 1 | 405.237462 | desktop |
2 | 13244 | desktop | 77.422366 | 4 | 7200.28 | 0 | 23 | 8 | 4 | 78.415122 | 1028.611429 | 3 | 3 | 1 | 548.905851 | desktop |
深度为 2 时,会使用提供的原语生成许多特征。综合这些定义的算法在此 论文 中描述。在返回的特征矩阵中,让我们理解其中一个深度为 2 的特征。
[4]:
feature_matrix[["MEAN(sessions.SUM(transactions.amount))"]]
[4]:
MEAN(sessions.SUM(transactions.amount)) | |
---|---|
customer_id | |
5 | 1058.276667 |
4 | 1090.960000 |
1 | 1128.202500 |
3 | 1039.436667 |
2 | 1028.611429 |
对于每个客户,此特征
计算每个会话的所有交易金额的
sum
,以获得每个会话的总金额,然后将
mean
应用于跨多个会话的总金额,以确定 每个会话的平均花费金额
我们将此特征称为深度为 2 的“深度特征”。
让我们看看另一个深度为 2 的特征,它计算每个客户 一天中开始会话的最常见时间
[5]:
feature_matrix[["MODE(sessions.HOUR(session_start))"]]
[5]:
MODE(sessions.HOUR(session_start)) | |
---|---|
customer_id | |
5 | 0 |
4 | 1 |
1 | 6 |
3 | 5 |
2 | 3 |
对于每个客户,此特征计算
他或她每个会话开始当天的
hour
,然后使用统计函数
mode
来识别他或她开始会话的最常见时间
堆叠生成的特征比单独的原语本身更具表现力。这使得能够自动创建用于机器学习的复杂模式。
注意
您可以通过在其上调用 featuretools.graph_feature()
来图形化地可视化特征的血缘关系。您还可以使用 featuretools.describe_feature()
生成特征的英文描述。有关更多详细信息,请参阅 生成特征描述。
更改目标 DataFrame#
DFS 功能强大,因为我们可以为数据集中的任何 DataFrame 创建特征矩阵。如果我们将目标 DataFrame 切换到“sessions”,我们可以为每个会话而不是每个客户合成特征。现在,我们可以使用这些特征来预测会话的结果。
[6]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=["mean", "sum", "mode"],
trans_primitives=["month", "hour"],
max_depth=2,
)
feature_matrix.head(5)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
[6]:
customer_id | device | MEAN(transactions.amount) | MODE(transactions.product_id) | SUM(transactions.amount) | HOUR(session_start) | MONTH(session_start) | customers.zip_code | MODE(transactions.HOUR(transaction_time)) | MODE(transactions.MONTH(transaction_time)) | customers.MODE(sessions.device) | customers.MEAN(transactions.amount) | customers.MODE(transactions.product_id) | customers.SUM(transactions.amount) | customers.HOUR(birthday) | customers.HOUR(join_date) | customers.MONTH(birthday) | customers.MONTH(join_date) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
session_id | ||||||||||||||||||
1 | 2 | desktop | 76.813125 | 3 | 1229.01 | 0 | 1 | 13244 | 0 | 1 | desktop | 77.422366 | 4 | 7200.28 | 0 | 23 | 8 | 4 |
2 | 5 | mobile | 74.696000 | 5 | 746.96 | 0 | 1 | 60091 | 0 | 1 | mobile | 80.375443 | 5 | 6349.66 | 0 | 5 | 7 | 7 |
3 | 4 | mobile | 88.600000 | 1 | 1329.00 | 0 | 1 | 60091 | 0 | 1 | mobile | 80.070459 | 2 | 8727.68 | 0 | 20 | 8 | 4 |
4 | 1 | mobile | 64.557200 | 5 | 1613.93 | 0 | 1 | 60091 | 0 | 1 | mobile | 71.631905 | 4 | 9025.62 | 0 | 10 | 7 | 4 |
5 | 4 | mobile | 70.638182 | 5 | 777.02 | 1 | 1 | 60091 | 1 | 1 | mobile | 80.070459 | 2 | 8727.68 | 0 | 20 | 8 | 4 |
如我们所见,DFS 还会基于父 DataFrame 构建深度特征,在此示例中,是特定会话的客户。例如,下面的特征计算会话客户的平均交易金额。
[7]:
feature_matrix[["customers.MEAN(transactions.amount)"]].head(5)
[7]:
customers.MEAN(transactions.amount) | |
---|---|
session_id | |
1 | 77.422366 |
2 | 80.375443 |
3 | 80.070459 |
4 | 71.631905 |
5 | 80.070459 |
改进特征输出#
要了解在 DFS 中更改哪些参数,请阅读 调整深度特征合成。