调优深度特征合成#
有几个参数可以调优来改变 DFS 的输出。我们将使用以下 transactions
EntitySet 来探索这些参数。
[1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
es
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[1]:
Entityset: transactions
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 3]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 5]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
使用“种子特征”#
种子特征是用户提供给 DFS 的手动定义且针对特定问题的特征。深度特征合成将在可能的情况下自动在这些特征之上堆叠新的特征。
通过使用种子特征,我们可以在特征工程自动化中包含领域特定知识。对于下面的种子特征,领域知识可能在于,对于特定零售商来说,超过 $125 的交易将被视为昂贵的购买。
[2]:
expensive_purchase = ft.Feature(es["transactions"].ww["amount"]) > 125
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["percent_true"],
seed_features=[expensive_purchase],
)
feature_matrix[["PERCENT_TRUE(transactions.amount > 125)"]]
[2]:
PERCENT_TRUE(transactions.amount > 125) | |
---|---|
customer_id | |
5 | 0.227848 |
4 | 0.220183 |
1 | 0.119048 |
3 | 0.182796 |
2 | 0.129032 |
现在我们可以看到,PERCENT_TRUE
基元已自动应用于 transactions
表中的布尔类型 expensive_purchase
特征。由此产生的特征可以理解为对于每个客户,被视为昂贵的交易的百分比。
向列添加“有趣”值#
有时我们希望在执行计算之前创建基于第二个值进行条件过滤的特征。我们称这种额外的过滤器为“where 子句”。在深度特征合成中,通过在 DFS 的 where_primitives
参数中包含基元来使用 where 子句。
默认情况下,where 子句是使用列的 interesting_values
构建的。
可以通过调用 es.add_interesting_values()
自动确定并为 pandas EntitySet 中的每个 DataFrame 添加有趣值。
[3]:
values_dict = {"device": ["desktop", "mobile", "tablet"]}
es.add_interesting_values(dataframe_name="sessions", values=values_dict)
有趣值存储在 DataFrame 的 Woodwork 类型信息中。
[4]:
es["sessions"].ww.columns["device"].metadata
[4]:
{'dataframe_name': 'sessions',
'entityset_id': 'transactions',
'interesting_values': ['desktop', 'mobile', 'tablet']}
现在,sessions
表中的 device
列已经设置了有趣值,我们可以使用 DFS 的 where_primitives
参数指定我们想要使用 where 子句的聚合基元。
[5]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["count", "avg_time_between"],
where_primitives=["count", "avg_time_between"],
trans_primitives=[],
)
feature_matrix
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/primitives/standard/aggregation/avg_time_between.py:61: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
[5]:
zip_code | AVG_TIME_BETWEEN(sessions.session_start) | COUNT(sessions) | AVG_TIME_BETWEEN(transactions.transaction_time) | COUNT(transactions) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) | COUNT(sessions WHERE device = tablet) | COUNT(sessions WHERE device = desktop) | ... | AVG_TIME_BETWEEN(transactions.sessions.session_start) | AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) | AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) | AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) | AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) | AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet) | AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile) | COUNT(transactions WHERE sessions.device = desktop) | COUNT(transactions WHERE sessions.device = tablet) | COUNT(transactions WHERE sessions.device = mobile) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
5 | 60091 | 5577.000000 | 6 | 363.333333 | 79 | NaN | 9685.0 | 13942.500000 | 1 | 2 | ... | 357.500000 | 345.892857 | 0.000000 | 796.714286 | 376.071429 | 65.000000 | 809.714286 | 29 | 14 | 36 |
4 | 60091 | 2516.428571 | 8 | 168.518519 | 109 | NaN | 4127.5 | 3336.666667 | 1 | 3 | ... | 163.101852 | 223.108108 | 0.000000 | 192.500000 | 238.918919 | 65.000000 | 206.250000 | 38 | 18 | 53 |
1 | 60091 | 3305.714286 | 8 | 192.920000 | 126 | 8807.5 | 7150.0 | 11570.000000 | 3 | 2 | ... | 185.120000 | 275.000000 | 419.404762 | 420.727273 | 302.500000 | 442.619048 | 438.454545 | 27 | 43 | 56 |
3 | 13244 | 5096.000000 | 6 | 287.554348 | 93 | NaN | 4745.0 | NaN | 1 | 4 | ... | 276.956522 | 233.360656 | 0.000000 | 0.000000 | 251.475410 | 65.000000 | 65.000000 | 62 | 15 | 16 |
2 | 13244 | 4907.500000 | 7 | 328.532609 | 93 | 5330.0 | 6890.0 | 1690.000000 | 2 | 3 | ... | 320.054348 | 417.575758 | 197.407407 | 56.333333 | 435.303030 | 226.296296 | 82.333333 | 34 | 28 | 31 |
5 行 × 21 列
现在,我们有了几个潜在有用的新特征。这里是其中两个基于 where 子句“使用的设备是平板电脑”构建的特征:
[6]:
feature_matrix[
[
"COUNT(sessions WHERE device = tablet)",
"AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)",
]
]
[6]:
COUNT(sessions WHERE device = tablet) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) | |
---|---|---|
customer_id | ||
5 | 1 | NaN |
4 | 1 | NaN |
1 | 3 | 8807.5 |
3 | 1 | NaN |
2 | 2 | 5330.0 |
第一个特征 COUNT(sessions WHERE device = tablet)
可以理解为表示一个客户在平板电脑上完成了多少次会话。
第二个特征 AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
计算的是这些会话之间的时间。
我们可以看到,在平板电脑上只有 0 或 1 个会话的客户,其会话之间的平均时间是 NaN
值。
编码分类特征#
机器学习算法通常期望所有数据都是数值型或具有明确数值表示(如对应于 0
和 1
的布尔值)的数据。当深度特征合成生成分类特征时,我们可以使用 Featuretools 对其进行编码。
[7]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mode"],
trans_primitives=["time_since"],
max_depth=1,
)
feature_matrix
[7]:
zip_code | MODE(sessions.device) | TIME_SINCE(birthday) | TIME_SINCE(join_date) | |
---|---|---|---|---|
customer_id | ||||
5 | 60091 | mobile | 1.255893e+09 | 4.363689e+08 |
4 | 60091 | mobile | 5.601134e+08 | 4.134201e+08 |
1 | 60091 | mobile | 9.412238e+08 | 4.126761e+08 |
3 | 13244 | desktop | 6.463406e+08 | 4.024632e+08 |
2 | 13244 | desktop | 1.191006e+09 | 3.811807e+08 |
此特征矩阵包含两个本质上是分类的列:zip_code
和 MODE(sessions.device)
。我们可以使用特征矩阵和特征定义将这些分类值编码为布尔值。Featuretools 提供了对 DFS 输出应用独热编码的功能。
[8]:
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc
[8]:
TIME_SINCE(birthday) | TIME_SINCE(join_date) | zip_code = 60091 | zip_code = 13244 | zip_code is unknown | MODE(sessions.device) = mobile | MODE(sessions.device) = desktop | MODE(sessions.device) is unknown | |
---|---|---|---|---|---|---|---|---|
customer_id | ||||||||
5 | 1.255893e+09 | 4.363689e+08 | True | False | False | True | False | False |
4 | 5.601134e+08 | 4.134201e+08 | True | False | False | True | False | False |
1 | 9.412238e+08 | 4.126761e+08 | True | False | False | True | False | False |
3 | 6.463406e+08 | 4.024632e+08 | False | True | False | False | True | False |
2 | 1.191006e+09 | 3.811807e+08 | False | True | False | False | True | False |
返回的特征矩阵现在以可被机器学习算法解释的方式进行了编码。请注意,不需要编码的列仍然包含在内。此外,我们获得了一组新的特征定义,其中包含编码后的值。
[9]:
features_enc
[9]:
[<Feature: zip_code = 60091>,
<Feature: zip_code = 13244>,
<Feature: zip_code is unknown>,
<Feature: MODE(sessions.device) = mobile>,
<Feature: MODE(sessions.device) = desktop>,
<Feature: MODE(sessions.device) is unknown>,
<Feature: TIME_SINCE(birthday)>,
<Feature: TIME_SINCE(join_date)>]
这些特征可用于在新数据上计算相同的编码值。有关生产环境中特征工程的更多信息,请阅读部署指南。