调优深度特征合成#

有几个参数可以调优来改变 DFS 的输出。我们将使用以下 transactions EntitySet 来探索这些参数。

[1]:
import featuretools as ft

es = ft.demo.load_mock_customer(return_entityset=True)
es
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[1]:
Entityset: transactions
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 5]
    customers [Rows: 5, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

使用“种子特征”#

种子特征是用户提供给 DFS 的手动定义且针对特定问题的特征。深度特征合成将在可能的情况下自动在这些特征之上堆叠新的特征。

通过使用种子特征,我们可以在特征工程自动化中包含领域特定知识。对于下面的种子特征,领域知识可能在于,对于特定零售商来说,超过 $125 的交易将被视为昂贵的购买。

[2]:
expensive_purchase = ft.Feature(es["transactions"].ww["amount"]) > 125

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["percent_true"],
    seed_features=[expensive_purchase],
)
feature_matrix[["PERCENT_TRUE(transactions.amount > 125)"]]
[2]:
PERCENT_TRUE(transactions.amount > 125)
customer_id
5 0.227848
4 0.220183
1 0.119048
3 0.182796
2 0.129032

现在我们可以看到,PERCENT_TRUE 基元已自动应用于 transactions 表中的布尔类型 expensive_purchase 特征。由此产生的特征可以理解为对于每个客户,被视为昂贵的交易的百分比。

向列添加“有趣”值#

有时我们希望在执行计算之前创建基于第二个值进行条件过滤的特征。我们称这种额外的过滤器为“where 子句”。在深度特征合成中,通过在 DFS 的 where_primitives 参数中包含基元来使用 where 子句。

默认情况下,where 子句是使用列的 interesting_values 构建的。

可以通过调用 es.add_interesting_values() 自动确定并为 pandas EntitySet 中的每个 DataFrame 添加有趣值。

[3]:
values_dict = {"device": ["desktop", "mobile", "tablet"]}
es.add_interesting_values(dataframe_name="sessions", values=values_dict)

有趣值存储在 DataFrame 的 Woodwork 类型信息中。

[4]:
es["sessions"].ww.columns["device"].metadata
[4]:
{'dataframe_name': 'sessions',
 'entityset_id': 'transactions',
 'interesting_values': ['desktop', 'mobile', 'tablet']}

现在,sessions 表中的 device 列已经设置了有趣值,我们可以使用 DFS 的 where_primitives 参数指定我们想要使用 where 子句的聚合基元。

[5]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count", "avg_time_between"],
    where_primitives=["count", "avg_time_between"],
    trans_primitives=[],
)
feature_matrix
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
  x = x.view("int64")
[5]:
zip_code AVG_TIME_BETWEEN(sessions.session_start) COUNT(sessions) AVG_TIME_BETWEEN(transactions.transaction_time) COUNT(transactions) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) COUNT(sessions WHERE device = tablet) COUNT(sessions WHERE device = desktop) ... AVG_TIME_BETWEEN(transactions.sessions.session_start) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile) COUNT(transactions WHERE sessions.device = desktop) COUNT(transactions WHERE sessions.device = tablet) COUNT(transactions WHERE sessions.device = mobile)
customer_id
5 60091 5577.000000 6 363.333333 79 NaN 9685.0 13942.500000 1 2 ... 357.500000 345.892857 0.000000 796.714286 376.071429 65.000000 809.714286 29 14 36
4 60091 2516.428571 8 168.518519 109 NaN 4127.5 3336.666667 1 3 ... 163.101852 223.108108 0.000000 192.500000 238.918919 65.000000 206.250000 38 18 53
1 60091 3305.714286 8 192.920000 126 8807.5 7150.0 11570.000000 3 2 ... 185.120000 275.000000 419.404762 420.727273 302.500000 442.619048 438.454545 27 43 56
3 13244 5096.000000 6 287.554348 93 NaN 4745.0 NaN 1 4 ... 276.956522 233.360656 0.000000 0.000000 251.475410 65.000000 65.000000 62 15 16
2 13244 4907.500000 7 328.532609 93 5330.0 6890.0 1690.000000 2 3 ... 320.054348 417.575758 197.407407 56.333333 435.303030 226.296296 82.333333 34 28 31

5 行 × 21 列

现在,我们有了几个潜在有用的新特征。这里是其中两个基于 where 子句“使用的设备是平板电脑”构建的特征:

[6]:
feature_matrix[
    [
        "COUNT(sessions WHERE device = tablet)",
        "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)",
    ]
]
[6]:
COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id
5 1 NaN
4 1 NaN
1 3 8807.5
3 1 NaN
2 2 5330.0

第一个特征 COUNT(sessions WHERE device = tablet) 可以理解为表示一个客户在平板电脑上完成了多少次会话

第二个特征 AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) 计算的是这些会话之间的时间

我们可以看到,在平板电脑上只有 0 或 1 个会话的客户,其会话之间的平均时间是 NaN 值。

编码分类特征#

机器学习算法通常期望所有数据都是数值型或具有明确数值表示(如对应于 01 的布尔值)的数据。当深度特征合成生成分类特征时,我们可以使用 Featuretools 对其进行编码。

[7]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mode"],
    trans_primitives=["time_since"],
    max_depth=1,
)

feature_matrix
[7]:
zip_code MODE(sessions.device) TIME_SINCE(birthday) TIME_SINCE(join_date)
customer_id
5 60091 mobile 1.255893e+09 4.363689e+08
4 60091 mobile 5.601134e+08 4.134201e+08
1 60091 mobile 9.412238e+08 4.126761e+08
3 13244 desktop 6.463406e+08 4.024632e+08
2 13244 desktop 1.191006e+09 3.811807e+08

此特征矩阵包含两个本质上是分类的列:zip_codeMODE(sessions.device)。我们可以使用特征矩阵和特征定义将这些分类值编码为布尔值。Featuretools 提供了对 DFS 输出应用独热编码的功能。

[8]:
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc
[8]:
TIME_SINCE(birthday) TIME_SINCE(join_date) zip_code = 60091 zip_code = 13244 zip_code is unknown MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown
customer_id
5 1.255893e+09 4.363689e+08 True False False True False False
4 5.601134e+08 4.134201e+08 True False False True False False
1 9.412238e+08 4.126761e+08 True False False True False False
3 6.463406e+08 4.024632e+08 False True False False True False
2 1.191006e+09 3.811807e+08 False True False False True False

返回的特征矩阵现在以可被机器学习算法解释的方式进行了编码。请注意,不需要编码的列仍然包含在内。此外,我们获得了一组新的特征定义,其中包含编码后的值。

[9]:
features_enc
[9]:
[<Feature: zip_code = 60091>,
 <Feature: zip_code = 13244>,
 <Feature: zip_code is unknown>,
 <Feature: MODE(sessions.device) = mobile>,
 <Feature: MODE(sessions.device) = desktop>,
 <Feature: MODE(sessions.device) is unknown>,
 <Feature: TIME_SINCE(birthday)>,
 <Feature: TIME_SINCE(join_date)>]

这些特征可用于在新数据上计算相同的编码值。有关生产环境中特征工程的更多信息,请阅读部署指南。