深度特征合成#

深度特征合成 (DFS) 是一种自动化方法，用于对关系和时间数据执行特征工程。

输入数据#

深度特征合成需要结构化数据集才能执行特征工程。为了展示 DFS 的功能，我们将使用一个模拟客户交易数据集。

注意

在使用 DFS 之前，建议您将数据准备为 EntitySet。请参阅使用 EntitySet 表示数据了解如何操作。

[1]:

import featuretools as ft

es = ft.demo.load_mock_customer(return_entityset=True)
es

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[1]:

Entityset: transactions
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 5]
    customers [Rows: 5, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

数据准备为 .EntitySet 后，我们就可以为目标 DataFrame（例如 customers）自动生成特征。

运行 DFS#

通常，在没有自动化特征工程的情况下，数据科学家会编写代码来聚合客户数据，并应用不同的统计函数，从而生成量化客户行为的特征。在此示例中，专家可能对诸如：总会话数 或 客户注册月份 等特征感兴趣。

当我们将 target_dataframe 指定为 customers，并将 "count" 和 "month" 指定为原语时，DFS 可以生成这些特征。

[2]:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count"],
    trans_primitives=["month"],
    max_depth=1,
)
feature_matrix

[2]:

	zip_code	COUNT(sessions)	MONTH(birthday)	MONTH(join_date)
customer_id
5	60091	6	7	7
4	60091	8	8	4
1	60091	8	7	4
3	13244	6	11	8
2	13244	7	8	4

在上面的示例中，"count" 是一个 聚合原语，因为它根据与一个客户相关的多个会话计算出一个单一值。"month" 被称为 转换原语，因为它获取客户的一个值并将其转换为另一个值。

注意

特征原语是 Featuretools 的基本组成部分。要了解更多信息，请阅读特征原语。

创建“深度特征”#

深度特征合成这个名称源于算法能够堆叠原语以生成更复杂特征的能力。每次堆叠一个原语，都会增加特征的“深度”。max_depth 参数控制 DFS 返回特征的最大深度。让我们尝试运行 max_depth=2 的 DFS。

[3]:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "mode"],
    trans_primitives=["month", "hour"],
    max_depth=2,
)
feature_matrix

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(

[3]:

	zip_code	MODE(sessions.device)	MEAN(transactions.amount)	MODE(transactions.product_id)	SUM(transactions.amount)	HOUR(birthday)	HOUR(join_date)	MONTH(birthday)	MONTH(join_date)	MEAN(sessions.MEAN(transactions.amount))	MEAN(sessions.SUM(transactions.amount))	MODE(sessions.HOUR(session_start))	MODE(sessions.MODE(transactions.product_id))	MODE(sessions.MONTH(session_start))	SUM(sessions.MEAN(transactions.amount))	MODE(transactions.sessions.device)
customer_id
5	60091	mobile	80.375443	5	6349.66	0	5	7	7	78.705187	1058.276667	0	3	1	472.231119	mobile
4	60091	mobile	80.070459	2	8727.68	0	20	8	4	81.207189	1090.960000	1	1	1	649.657515	mobile
1	60091	mobile	71.631905	4	9025.62	0	10	7	4	72.774140	1128.202500	6	4	1	582.193117	mobile
3	13244	desktop	67.060430	1	6236.62	0	15	11	8	67.539577	1039.436667	5	1	1	405.237462	desktop
2	13244	desktop	77.422366	4	7200.28	0	23	8	4	78.415122	1028.611429	3	3	1	548.905851	desktop

深度为 2 时，会使用提供的原语生成许多特征。综合这些定义的算法在此论文中描述。在返回的特征矩阵中，让我们理解其中一个深度为 2 的特征。

[4]:

feature_matrix[["MEAN(sessions.SUM(transactions.amount))"]]

[4]:

	MEAN(sessions.SUM(transactions.amount))
customer_id
5	1058.276667
4	1090.960000
1	1128.202500
3	1039.436667
2	1028.611429

对于每个客户，此特征

计算每个会话的所有交易金额的 sum，以获得每个会话的总金额，
然后将 mean 应用于跨多个会话的总金额，以确定 每个会话的平均花费金额

我们将此特征称为深度为 2 的“深度特征”。

让我们看看另一个深度为 2 的特征，它计算每个客户 一天中开始会话的最常见时间

[5]:

feature_matrix[["MODE(sessions.HOUR(session_start))"]]

[5]:

	MODE(sessions.HOUR(session_start))
customer_id
5	0
4	1
1	6
3	5
2	3

对于每个客户，此特征计算

他或她每个会话开始当天的 hour，然后
使用统计函数 mode 来识别他或她开始会话的最常见时间

堆叠生成的特征比单独的原语本身更具表现力。这使得能够自动创建用于机器学习的复杂模式。

注意

您可以通过在其上调用 featuretools.graph_feature() 来图形化地可视化特征的血缘关系。您还可以使用 featuretools.describe_feature() 生成特征的英文描述。有关更多详细信息，请参阅生成特征描述。

更改目标 DataFrame#

DFS 功能强大，因为我们可以为数据集中的任何 DataFrame 创建特征矩阵。如果我们将目标 DataFrame 切换到“sessions”，我们可以为每个会话而不是每个客户合成特征。现在，我们可以使用这些特征来预测会话的结果。

[6]:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=["mean", "sum", "mode"],
    trans_primitives=["month", "hour"],
    max_depth=2,
)
feature_matrix.head(5)

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2e2235ea60> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2e223639d0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(

[6]:

	customer_id	device	MEAN(transactions.amount)	MODE(transactions.product_id)	SUM(transactions.amount)	HOUR(session_start)	MONTH(session_start)	customers.zip_code	MODE(transactions.HOUR(transaction_time))	MODE(transactions.MONTH(transaction_time))	customers.MODE(sessions.device)	customers.MEAN(transactions.amount)	customers.MODE(transactions.product_id)	customers.SUM(transactions.amount)	customers.HOUR(birthday)	customers.HOUR(join_date)	customers.MONTH(birthday)	customers.MONTH(join_date)
session_id
1	2	desktop	76.813125	3	1229.01	0	1	13244	0	1	desktop	77.422366	4	7200.28	0	23	8	4
2	5	mobile	74.696000	5	746.96	0	1	60091	0	1	mobile	80.375443	5	6349.66	0	5	7	7
3	4	mobile	88.600000	1	1329.00	0	1	60091	0	1	mobile	80.070459	2	8727.68	0	20	8	4
4	1	mobile	64.557200	5	1613.93	0	1	60091	0	1	mobile	71.631905	4	9025.62	0	10	7	4
5	4	mobile	70.638182	5	777.02	1	1	60091	1	1	mobile	80.070459	2	8727.68	0	20	8	4

如我们所见，DFS 还会基于父 DataFrame 构建深度特征，在此示例中，是特定会话的客户。例如，下面的特征计算会话客户的平均交易金额。

[7]:

feature_matrix[["customers.MEAN(transactions.amount)"]].head(5)

[7]:

	customers.MEAN(transactions.amount)
session_id
1	77.422366
2	80.375443
3	80.070459
4	71.631905
5	80.070459

改进特征输出#

要了解在 DFS 中更改哪些参数，请阅读调整深度特征合成。

目录

上一主题

下一主题

本页

深度特征合成#

输入数据#

运行 DFS#

创建“深度特征”#

更改目标 DataFrame#

改进特征输出#

目录

上一主题

下一主题

本页

快速搜索

深度特征合成#

输入数据#

运行 DFS#

创建“深度特征”#

更改目标 DataFrame#

改进特征输出#