使用 EntitySet 表示数据#
一个 EntitySet
是数据框及其之间关系的集合。它们对于准备用于特征工程的原始结构化数据集很有用。虽然 Featuretools 中的许多函数将 dataframes
和 relationships
作为单独的参数,但建议创建一个 EntitySet
,以便您可以根据需要更轻松地操作数据。
原始数据#
下面是与客户交易相关的两个数据表(表示为 Pandas 数据框)。第一个是将交易、会话和客户合并而成,结果看起来可能像你在日志文件中看到的那样
[1]:
import featuretools as ft
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
transactions_df.sample(10)
[1]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | session_start | zip_code | join_date | birthday | |
---|---|---|---|---|---|---|---|---|---|---|---|
264 | 296 | 20 | 2014-01-01 04:46:00 | 5 | 53.22 | 5 | desktop | 2014-01-01 04:46:00 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
19 | 74 | 2 | 2014-01-01 00:20:35 | 1 | 106.99 | 5 | mobile | 2014-01-01 00:17:20 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
314 | 141 | 23 | 2014-01-01 05:40:10 | 5 | 128.26 | 3 | desktop | 2014-01-01 05:32:35 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
290 | 236 | 21 | 2014-01-01 05:14:10 | 5 | 57.09 | 4 | desktop | 2014-01-01 05:02:15 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
379 | 292 | 28 | 2014-01-01 06:50:35 | 1 | 133.71 | 5 | mobile | 2014-01-01 06:50:35 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
335 | 482 | 25 | 2014-01-01 06:02:55 | 1 | 26.30 | 3 | desktop | 2014-01-01 05:59:40 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
293 | 452 | 21 | 2014-01-01 05:17:25 | 5 | 69.62 | 4 | desktop | 2014-01-01 05:02:15 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
271 | 169 | 20 | 2014-01-01 04:53:35 | 3 | 78.87 | 5 | desktop | 2014-01-01 04:46:00 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
404 | 476 | 29 | 2014-01-01 07:17:40 | 4 | 11.62 | 1 | mobile | 2014-01-01 07:10:05 | 60091 | 2011-04-17 10:48:33 | 1994-07-18 |
179 | 72 | 12 | 2014-01-01 03:13:55 | 2 | 143.96 | 4 | desktop | 2014-01-01 03:04:10 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
第二个数据框是涉及这些交易的产品列表。
[2]:
products_df = data["products"]
products_df
[2]:
product_id | brand | |
---|---|---|
0 | 1 | B |
1 | 2 | B |
2 | 3 | B |
3 | 4 | B |
4 | 5 | A |
创建 EntitySet#
首先,我们初始化一个 EntitySet
。如果你想给它命名,可以选择向构造函数提供一个 id
。
[3]:
es = ft.EntitySet(id="customer_data")
添加数据框#
首先,我们将交易数据框添加到 EntitySet
中。在调用 add_dataframe
时,我们指定了三个重要参数
index
参数指定了唯一标识数据框中行的列。time_index
参数告诉 Featuretools 数据创建的时间。logical_types
参数指示“product_id”应解释为 Categorical(分类)列,即使它在基础数据中只是一个整数。
[4]:
from woodwork.logical_types import Categorical, PostalCode
es = es.add_dataframe(
dataframe_name="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
logical_types={
"product_id": Categorical,
"zip_code": PostalCode,
},
)
es
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[4]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
Relationships:
No relationships
你也可以使用 EntitySet
对象的 setter 来添加数据框
注意
你也可以使用 EntitySet
对象的 setter 来添加数据框
es["transactions"] = transactions_df
这将使用 add_dataframe 的默认实现,特别是以下几点
如果数据框没有初始化 Woodwork,则第一列将成为索引列
如果数据框没有初始化 Woodwork,所有列将由 Woodwork 推断。
如果需要控制时间索引列和逻辑类型,应在添加数据框之前初始化 Woodwork。
注意
你也可以通过调用 EntitySet.plot()
以图形方式显示你的 EntitySet 结构。
此方法将数据框中的每一列关联到 Woodwork 逻辑类型。每种逻辑类型都可以有一个相关的标准语义标签,用于定义列的数据类型。如果你没有为列指定逻辑类型,它将根据基础数据进行推断。逻辑类型和语义标签列在数据框的 schema 中。有关使用逻辑类型和语义标签的更多信息,请参阅 Woodwork 文档。
[5]:
es["transactions"].ww.schema
[5]:
逻辑类型 | 语义标签 | |
---|---|---|
列 | ||
transaction_id | Integer | ['index'] |
session_id | Integer | ['numeric'] |
transaction_time | Datetime | ['time_index'] |
product_id | Categorical | ['category'] |
amount | Double | ['numeric'] |
customer_id | Integer | ['numeric'] |
device | Categorical | ['category'] |
session_start | Datetime | [] |
zip_code | PostalCode | ['category'] |
join_date | Datetime | [] |
birthday | Datetime | [] |
现在,我们可以对产品数据框做同样的事情。
[6]:
es = es.add_dataframe(
dataframe_name="products", dataframe=products_df, index="product_id"
)
es
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[6]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
products [Rows: 5, Columns: 2]
Relationships:
No relationships
在 EntitySet
中有两个数据框后,我们可以添加它们之间的关系。
添加关系#
我们希望通过每个数据框中名为“product_id”的列来关联这两个数据框。每个产品都关联着多个交易,因此产品数据框被称为父数据框,而交易数据框被称为子数据框。指定关系时,我们需要四个参数:父数据框名称、父列名称、子数据框名称和子列名称。请注意,每个关系必须表示一对多关系,而不是一对一或多对多关系。
[7]:
es = es.add_relationship("products", "product_id", "transactions", "product_id")
es
[7]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
products [Rows: 5, Columns: 2]
Relationships:
transactions.product_id -> products.product_id
现在,我们看到该关系已添加到我们的 EntitySet
中。
从现有表创建数据框#
处理原始数据时,通常有足够的信息来支持创建新的数据框。为了创建新的会话数据框及其关系,我们“规范化”交易数据框。
[8]:
es = es.normalize_dataframe(
base_dataframe_name="transactions",
new_dataframe_name="sessions",
index="session_id",
make_time_index="session_start",
additional_columns=[
"device",
"customer_id",
"zip_code",
"session_start",
"join_date",
],
)
es
[8]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 6]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
查看上面的输出,我们看到此方法执行了两个操作
它基于“transactions”中的“session_id”和“session_start”列创建了一个名为“sessions”的新数据框
它添加了一个连接“transactions”和“sessions”的关系
如果我们查看交易数据框和新会话数据框的 schema,我们会看到另外两个自动执行的操作
[9]:
es["transactions"].ww.schema
[9]:
逻辑类型 | 语义标签 | |
---|---|---|
列 | ||
transaction_id | Integer | ['index'] |
session_id | Integer | ['numeric', 'foreign_key'] |
transaction_time | Datetime | ['time_index'] |
product_id | Categorical | ['foreign_key', 'category'] |
amount | Double | ['numeric'] |
birthday | Datetime | [] |
[10]:
es["sessions"].ww.schema
[10]:
逻辑类型 | 语义标签 | |
---|---|---|
列 | ||
session_id | Integer | ['index'] |
device | Categorical | ['category'] |
customer_id | Integer | ['numeric'] |
zip_code | PostalCode | ['category'] |
session_start | Datetime | ['time_index'] |
join_date | Datetime | [] |
它从“transactions”中移除了“device”、“customer_id”、“zip_code”和“join_date”,并在会话数据框中创建了新列。这减少了冗余信息,因为会话的这些属性在交易之间不会改变。
它将“session_start”复制并标记为时间索引列到新会话数据框中,以指示会话的开始。如果基础数据框有时间索引且未设置
make_time_index
,normalize_dataframe
将为新数据框创建一个时间索引。在这种情况下,它将使用每个会话的第一个交易的时间创建一个名为“first_transactions_time”的新时间索引。如果我们不希望创建此时间索引,可以将make_time_index
设置为False
。
如果我们查看数据框,我们可以看到 normalize_dataframe
对实际数据做了什么。
[11]:
es["sessions"].head(5)
[11]:
session_id | device | customer_id | zip_code | session_start | join_date | |
---|---|---|---|---|---|---|
1 | 1 | desktop | 2 | 13244 | 2014-01-01 00:00:00 | 2012-04-15 23:31:04 |
2 | 2 | mobile | 5 | 60091 | 2014-01-01 00:17:20 | 2010-07-17 05:27:50 |
3 | 3 | mobile | 4 | 60091 | 2014-01-01 00:28:10 | 2011-04-08 20:08:14 |
4 | 4 | mobile | 1 | 60091 | 2014-01-01 00:44:25 | 2011-04-17 10:48:33 |
5 | 5 | mobile | 4 | 60091 | 2014-01-01 01:11:30 | 2011-04-08 20:08:14 |
[12]:
es["transactions"].head(5)
[12]:
transaction_id | session_id | transaction_time | product_id | amount | birthday | |
---|---|---|---|---|---|---|
10 | 10 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 1986-08-18 |
2 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 | 1986-08-18 |
438 | 438 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 | 1986-08-18 |
192 | 192 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 | 1986-08-18 |
271 | 271 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 | 1986-08-18 |
为了完成此数据集的准备,使用相同的方法调用创建一个“customers”数据框。
[13]:
es = es.normalize_dataframe(
base_dataframe_name="sessions",
new_dataframe_name="customers",
index="customer_id",
make_time_index="join_date",
additional_columns=["zip_code", "join_date"],
)
es
[13]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 4]
customers [Rows: 5, Columns: 3]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
使用 EntitySet#
最后,我们准备好在 Featuretools 中使用此 EntitySet 的任何功能。例如,让我们为数据集中的每个产品构建一个特征矩阵。
[14]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="products")
feature_matrix
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f3215f278b0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f3215f23940> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f3215f270d0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f3215f279d0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f3215f23f70> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
to_merge = base_frame.groupby(
[14]:
COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | SKEW(transactions.amount) | STD(transactions.amount) | SUM(transactions.amount) | MODE(transactions.DAY(birthday)) | MODE(transactions.DAY(transaction_time)) | MODE(transactions.MONTH(birthday)) | ... | MODE(transactions.sessions.device) | NUM_UNIQUE(transactions.DAY(birthday)) | NUM_UNIQUE(transactions.DAY(transaction_time)) | NUM_UNIQUE(transactions.MONTH(birthday)) | NUM_UNIQUE(transactions.MONTH(transaction_time)) | NUM_UNIQUE(transactions.WEEKDAY(birthday)) | NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) | NUM_UNIQUE(transactions.YEAR(birthday)) | NUM_UNIQUE(transactions.YEAR(transaction_time)) | NUM_UNIQUE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
product_id | |||||||||||||||||||||
1 | 102 | 149.56 | 73.429314 | 6.84 | 0.125525 | 42.479989 | 7489.79 | 18 | 1 | 7 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
2 | 92 | 149.95 | 76.319891 | 5.73 | 0.151934 | 46.336308 | 7021.43 | 18 | 1 | 8 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
3 | 96 | 148.31 | 73.001250 | 5.89 | 0.223938 | 38.871405 | 7008.12 | 18 | 1 | 8 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
4 | 106 | 146.46 | 76.311038 | 5.81 | -0.132077 | 42.492501 | 8088.97 | 18 | 1 | 7 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
5 | 104 | 149.02 | 76.264904 | 5.91 | 0.098248 | 42.131902 | 7931.55 | 18 | 1 | 7 | ... | mobile | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
5 行 × 25 列
如我们所见,DFS 生成的特征使用了我们 EntitySet 的关系结构。因此,仔细考虑我们创建的数据框非常重要。