什么是 Featuretools?#
Featuretools 是一个执行自动化特征工程的框架。它擅长将时态和关系数据集转换为用于机器学习的特征矩阵。
5 分钟快速入门#
下面是一个使用深度特征合成 (DFS) 执行自动化特征工程的示例。在此示例中,我们将 DFS 应用于由带时间戳的客户交易组成的多表数据集。
[1]:
import featuretools as ft
加载模拟数据#
[2]:
data = ft.demo.load_mock_customer()
准备数据#
在这个模拟数据集中,有 3 个 DataFrame。
customers:进行过会话的唯一客户
sessions:唯一的会话和相关属性
transactions:此会话中的事件列表
[3]:
customers_df = data["customers"]
customers_df
[3]:
customer_id | zip_code | join_date | birthday | |
---|---|---|---|---|
0 | 1 | 60091 | 2011-04-17 10:48:33 | 1994-07-18 |
1 | 2 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
2 | 3 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
3 | 4 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
4 | 5 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
[4]:
sessions_df = data["sessions"]
sessions_df.sample(5)
[4]:
session_id | customer_id | device | session_start | |
---|---|---|---|---|
13 | 14 | 1 | tablet | 2014-01-01 03:28:00 |
6 | 7 | 3 | tablet | 2014-01-01 01:39:40 |
1 | 2 | 5 | mobile | 2014-01-01 00:17:20 |
28 | 29 | 1 | mobile | 2014-01-01 07:10:05 |
24 | 25 | 3 | desktop | 2014-01-01 05:59:40 |
[5]:
transactions_df = data["transactions"]
transactions_df.sample(5)
[5]:
transaction_id | session_id | transaction_time | product_id | amount | |
---|---|---|---|---|---|
74 | 417 | 5 | 2014-01-01 01:20:10 | 1 | 139.20 |
231 | 229 | 17 | 2014-01-01 04:10:15 | 2 | 90.79 |
434 | 127 | 31 | 2014-01-01 07:50:10 | 3 | 62.35 |
420 | 359 | 30 | 2014-01-01 07:35:00 | 3 | 72.70 |
54 | 249 | 4 | 2014-01-01 00:58:30 | 4 | 43.59 |
首先,我们指定一个包含数据集中所有 DataFrame 的字典。如果 DataFrame 存在索引列和时间索引列,则将其一并传入。
[6]:
dataframes = {
"customers": (customers_df, "customer_id"),
"sessions": (sessions_df, "session_id", "session_start"),
"transactions": (transactions_df, "transaction_id", "transaction_time"),
}
其次,我们指定 DataFrame 如何关联。当两个 DataFrame 具有一对多关系时,我们将“一”的那一方 DataFrame 称为“父 DataFrame”。父子关系定义如下
(parent_dataframe, parent_column, child_dataframe, child_column)
在此数据集中,我们有两种关系
[7]:
relationships = [
("sessions", "session_id", "transactions", "session_id"),
("customers", "customer_id", "sessions", "customer_id"),
]
注意
为了管理 DataFrame 和关系的设置,我们建议使用 EntitySet
类,该类提供了方便的 API 来管理此类数据。有关更多信息,请参阅 使用 EntitySet 表示数据。
运行深度特征合成#
DFS 的最小输入是一个 DataFrame 字典、一个关系列表以及我们要计算其特征的目标 DataFrame 名称。DFS 的输出是一个特征矩阵和相应的特征定义列表。
首先,让我们为数据中的每个客户创建一个特征矩阵
[8]:
feature_matrix_customers, features_defs = ft.dfs(
dataframes=dataframes,
relationships=relationships,
target_dataframe_name="customers",
)
feature_matrix_customers
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f2a84110940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2a84110820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2a8410b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f2a8410bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f2a84110040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2a8410b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2a84110820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f2a8410bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f2a84110940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f2a84110040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f2a8410bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f2a84110940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2a8410b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2a84110820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f2a84110040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f2a8410bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f2a84110940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2a84110820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f2a84110040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2a8410b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2a8410b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f2a84110940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2a84110820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f2a8410bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f2a84110040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
[8]:
zip_code | COUNT(sessions) | MODE(sessions.device) | NUM_UNIQUE(sessions.device) | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | MODE(transactions.product_id) | NUM_UNIQUE(transactions.product_id) | ... | STD(sessions.SKEW(transactions.amount)) | STD(sessions.SUM(transactions.amount)) | SUM(sessions.MAX(transactions.amount)) | SUM(sessions.MEAN(transactions.amount)) | SUM(sessions.MIN(transactions.amount)) | SUM(sessions.NUM_UNIQUE(transactions.product_id)) | SUM(sessions.SKEW(transactions.amount)) | SUM(sessions.STD(transactions.amount)) | MODE(transactions.sessions.device) | NUM_UNIQUE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
1 | 60091 | 8 | mobile | 3 | 126 | 139.43 | 71.631905 | 5.81 | 4 | 5 | ... | 0.589386 | 279.510713 | 1057.97 | 582.193117 | 78.59 | 40.0 | -0.476122 | 312.745952 | mobile | 3 |
2 | 13244 | 7 | desktop | 3 | 93 | 146.81 | 77.422366 | 8.73 | 4 | 5 | ... | 0.509798 | 251.609234 | 931.63 | 548.905851 | 154.60 | 35.0 | -0.277640 | 258.700528 | desktop | 3 |
3 | 13244 | 6 | desktop | 3 | 93 | 149.15 | 67.060430 | 5.89 | 1 | 5 | ... | 0.429374 | 219.021420 | 847.63 | 405.237462 | 66.21 | 29.0 | 2.286086 | 257.299895 | desktop | 3 |
4 | 60091 | 8 | mobile | 3 | 109 | 149.95 | 80.070459 | 5.73 | 2 | 5 | ... | 0.387884 | 235.992478 | 1157.99 | 649.657515 | 131.51 | 37.0 | 0.002764 | 356.125829 | mobile | 3 |
5 | 60091 | 6 | mobile | 3 | 79 | 149.02 | 80.375443 | 7.55 | 5 | 5 | ... | 0.415426 | 402.775486 | 839.76 | 472.231119 | 86.49 | 30.0 | 0.014384 | 259.873954 | mobile | 3 |
5 行 × 75 列
我们现在有几十个新特征来描述客户的行为。
更改目标 DataFrame#
DFS 如此强大的原因之一是它可以为 EntitySet 中的任何 DataFrame 创建特征矩阵。例如,如果我们想为会话构建特征。
[10]:
feature_matrix_sessions, features_defs = ft.dfs(
dataframes=dataframes, relationships=relationships, target_dataframe_name="sessions"
)
feature_matrix_sessions.head(5)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2a8410b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f2a84110940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f2a84110040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2a84110820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f2a8410bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f2a84110820> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f2a8410bee0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f2a84110040> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f2a84110940> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f2a8410b8b0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
[10]:
customer_id | device | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | MODE(transactions.product_id) | NUM_UNIQUE(transactions.product_id) | SKEW(transactions.amount) | STD(transactions.amount) | ... | customers.STD(transactions.amount) | customers.SUM(transactions.amount) | customers.DAY(birthday) | customers.DAY(join_date) | customers.MONTH(birthday) | customers.MONTH(join_date) | customers.WEEKDAY(birthday) | customers.WEEKDAY(join_date) | customers.YEAR(birthday) | customers.YEAR(join_date) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
session_id | |||||||||||||||||||||
1 | 2 | desktop | 16 | 141.66 | 76.813125 | 20.91 | 3 | 5 | 0.295458 | 41.600976 | ... | 37.705178 | 7200.28 | 18 | 15 | 8 | 4 | 0 | 6 | 1986 | 2012 |
2 | 5 | mobile | 10 | 135.25 | 74.696000 | 9.32 | 5 | 5 | -0.160550 | 45.893591 | ... | 44.095630 | 6349.66 | 28 | 17 | 7 | 7 | 5 | 5 | 1984 | 2010 |
3 | 4 | mobile | 15 | 147.73 | 88.600000 | 8.70 | 1 | 5 | -0.324012 | 46.240016 | ... | 45.068765 | 8727.68 | 15 | 8 | 8 | 4 | 1 | 4 | 2006 | 2011 |
4 | 1 | mobile | 25 | 129.00 | 64.557200 | 6.29 | 5 | 5 | 0.234349 | 40.187205 | ... | 40.442059 | 9025.62 | 18 | 17 | 7 | 4 | 0 | 6 | 1994 | 2011 |
5 | 4 | mobile | 11 | 139.20 | 70.638182 | 7.43 | 5 | 5 | 0.336381 | 48.918663 | ... | 45.068765 | 8727.68 | 15 | 8 | 8 | 4 | 1 | 4 | 2006 | 2011 |
5 行 × 44 列
理解特征输出#
通常,Featuretools 通过特征名称引用生成的特征。为了使特征更易于理解,Featuretools 提供了另外两个工具,featuretools.graph_feature()
和 featuretools.describe_feature()
,以帮助解释特征是什么以及 Featuretools 生成它的步骤。让我们看看这个示例特征
[11]:
feature = features_defs[18]
feature
[11]:
<Feature: MODE(transactions.WEEKDAY(transaction_time))>
特征血缘图#
特征血缘图通过可视化方式展示特征生成过程。从基础数据开始,它们一步一步地展示应用的基元以及生成的中间特征,最终创建出最终特征。
[12]:
ft.graph_feature(feature)
[12]:
![digraph "MODE(transactions.WEEKDAY(transaction_time))" {
graph [bb="0,0,1456,156",
rankdir=LR
];
node [label="\N",
shape=box
];
edge [arrowhead=none,
dir=forward,
style=dotted
];
{
graph [rank=min];
"1_WEEKDAY(transaction_time)_weekday" [height=0.94444,
label=<<FONT POINT-SIZE="12"><B>Step 1:</B> Transform<BR></BR></FONT>WEEKDAY>,
pos="140,41",
shape=diamond,
width=3.8889];
}
sessions [height=1.1389,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>★ sessions (target)</B></TD>
</TR>
<TR>
<TD ALIGN="LEFT" port="MODE(transactions.WEEKDAY(transaction_time))" BGCOLOR="#D9EAD3">MODE(transactions.WEEKDAY(transaction_time))</TD>
</TR>
</TABLE>>,
pos="1258,79",
shape=plaintext,
width=5.5];
transactions [height=2.1667,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>transactions</B></TD>
</TR><TR><TD ALIGN="LEFT" port="session_id">session_id</TD></TR>
<TR><TD ALIGN="LEFT" port="transaction_time">transaction_time</TD></TR>
<TR><TD ALIGN="LEFT" port="WEEKDAY(transaction_time)">WEEKDAY(transaction_time)</TD></TR>
</TABLE>>,
pos="438.5,78",
shape=plaintext,
width=3.4028];
transactions:transaction_time -> "1_WEEKDAY(transaction_time)_weekday" [arrowhead="",
pos="e,229.94,53.305 323.5,59 296.4,59 267.14,57.01 240.16,54.353",
style=solid];
"MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" [height=0.52778,
label="group by
session_id",
pos="641.5,60",
width=1.2361];
transactions:"WEEKDAY(transaction_time)" -> "MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" [arrowhead="",
pos="e,612.79,40.777 554.5,22 571.67,22 589.24,28.414 604,35.971",
style=solid];
transactions:session_id -> "MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" [pos="554.5,97 574.82,97 595.8,88.179 611.95,79.15"];
"0_MODE(transactions.WEEKDAY(transaction_time))_mode" [height=0.94444,
label=<<FONT POINT-SIZE="12"><B>Step 2:</B> Aggregation<BR></BR></FONT>MODE>,
pos="873,60",
shape=diamond,
width=4.1944];
"0_MODE(transactions.WEEKDAY(transaction_time))_mode" -> sessions:"MODE(transactions.WEEKDAY(transaction_time))" [arrowhead="",
pos="e,1067,60 1024.1,60 1035.1,60 1046.1,60 1056.8,60",
style=solid];
"1_WEEKDAY(transaction_time)_weekday" -> transactions:"WEEKDAY(transaction_time)" [arrowhead="",
pos="e,323.5,22 227.61,28.274 254.68,25.178 284.94,22.601 313.48,22.091",
style=solid];
"MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" -> "0_MODE(transactions.WEEKDAY(transaction_time))_mode" [arrowhead="",
pos="e,721.54,60 686.21,60 693.96,60 702.42,60 711.32,60",
style=solid];
}](_images/graphviz-364590668b742a386d1baf58bf941ccd77c28ad8.png)
特征描述#
Featuretools 还可以自动生成特征的英文句子描述。特征描述有助于解释特征是什么,并且可以通过包含手动定义的自定义定义来进一步改进。有关如何自定义自动生成特征描述的更多详细信息,请参阅 生成特征描述。
[13]:
ft.describe_feature(feature)
[13]:
'The most frequently occurring value of the day of the week of the "transaction_time" of all instances of "transactions" for each "session_id" in "sessions".'