使用 EntitySet 表示数据#

一个 EntitySet 是数据框及其之间关系的集合。它们对于准备用于特征工程的原始结构化数据集很有用。虽然 Featuretools 中的许多函数将 dataframesrelationships 作为单独的参数,但建议创建一个 EntitySet,以便您可以根据需要更轻松地操作数据。

原始数据#

下面是与客户交易相关的两个数据表(表示为 Pandas 数据框)。第一个是将交易、会话和客户合并而成,结果看起来可能像你在日志文件中看到的那样

[1]:
import featuretools as ft

data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])

transactions_df.sample(10)
[1]:
transaction_id session_id transaction_time product_id amount customer_id device session_start zip_code join_date birthday
264 296 20 2014-01-01 04:46:00 5 53.22 5 desktop 2014-01-01 04:46:00 60091 2010-07-17 05:27:50 1984-07-28
19 74 2 2014-01-01 00:20:35 1 106.99 5 mobile 2014-01-01 00:17:20 60091 2010-07-17 05:27:50 1984-07-28
314 141 23 2014-01-01 05:40:10 5 128.26 3 desktop 2014-01-01 05:32:35 13244 2011-08-13 15:42:34 2003-11-21
290 236 21 2014-01-01 05:14:10 5 57.09 4 desktop 2014-01-01 05:02:15 60091 2011-04-08 20:08:14 2006-08-15
379 292 28 2014-01-01 06:50:35 1 133.71 5 mobile 2014-01-01 06:50:35 60091 2010-07-17 05:27:50 1984-07-28
335 482 25 2014-01-01 06:02:55 1 26.30 3 desktop 2014-01-01 05:59:40 13244 2011-08-13 15:42:34 2003-11-21
293 452 21 2014-01-01 05:17:25 5 69.62 4 desktop 2014-01-01 05:02:15 60091 2011-04-08 20:08:14 2006-08-15
271 169 20 2014-01-01 04:53:35 3 78.87 5 desktop 2014-01-01 04:46:00 60091 2010-07-17 05:27:50 1984-07-28
404 476 29 2014-01-01 07:17:40 4 11.62 1 mobile 2014-01-01 07:10:05 60091 2011-04-17 10:48:33 1994-07-18
179 72 12 2014-01-01 03:13:55 2 143.96 4 desktop 2014-01-01 03:04:10 60091 2011-04-08 20:08:14 2006-08-15

第二个数据框是涉及这些交易的产品列表。

[2]:
products_df = data["products"]
products_df
[2]:
product_id brand
0 1 B
1 2 B
2 3 B
3 4 B
4 5 A

创建 EntitySet#

首先,我们初始化一个 EntitySet。如果你想给它命名,可以选择向构造函数提供一个 id

[3]:
es = ft.EntitySet(id="customer_data")

添加数据框#

首先,我们将交易数据框添加到 EntitySet 中。在调用 add_dataframe 时,我们指定了三个重要参数

  • index 参数指定了唯一标识数据框中行的列。

  • time_index 参数告诉 Featuretools 数据创建的时间。

  • logical_types 参数指示“product_id”应解释为 Categorical(分类)列,即使它在基础数据中只是一个整数。

[4]:
from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="transactions",
    dataframe=transactions_df,
    index="transaction_id",
    time_index="transaction_time",
    logical_types={
        "product_id": Categorical,
        "zip_code": PostalCode,
    },
)

es
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[4]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
  Relationships:
    No relationships

你也可以使用 EntitySet 对象的 setter 来添加数据框

注意

你也可以使用 EntitySet 对象的 setter 来添加数据框

es["transactions"] = transactions_df

这将使用 add_dataframe 的默认实现,特别是以下几点

  • 如果数据框没有初始化 Woodwork,则第一列将成为索引列

  • 如果数据框没有初始化 Woodwork,所有列将由 Woodwork 推断。

  • 如果需要控制时间索引列和逻辑类型,应在添加数据框之前初始化 Woodwork。

注意

你也可以通过调用 EntitySet.plot() 以图形方式显示你的 EntitySet 结构。

此方法将数据框中的每一列关联到 Woodwork 逻辑类型。每种逻辑类型都可以有一个相关的标准语义标签,用于定义列的数据类型。如果你没有为列指定逻辑类型,它将根据基础数据进行推断。逻辑类型和语义标签列在数据框的 schema 中。有关使用逻辑类型和语义标签的更多信息,请参阅 Woodwork 文档

[5]:
es["transactions"].ww.schema
[5]:
逻辑类型 语义标签
transaction_id Integer ['index']
session_id Integer ['numeric']
transaction_time Datetime ['time_index']
product_id Categorical ['category']
amount Double ['numeric']
customer_id Integer ['numeric']
device Categorical ['category']
session_start Datetime []
zip_code PostalCode ['category']
join_date Datetime []
birthday Datetime []

现在,我们可以对产品数据框做同样的事情。

[6]:
es = es.add_dataframe(
    dataframe_name="products", dataframe=products_df, index="product_id"
)

es
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[6]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    No relationships

EntitySet 中有两个数据框后,我们可以添加它们之间的关系。

添加关系#

我们希望通过每个数据框中名为“product_id”的列来关联这两个数据框。每个产品都关联着多个交易,因此产品数据框被称为父数据框,而交易数据框被称为子数据框。指定关系时,我们需要四个参数:父数据框名称、父列名称、子数据框名称和子列名称。请注意,每个关系必须表示一对多关系,而不是一对一或多对多关系。

[7]:
es = es.add_relationship("products", "product_id", "transactions", "product_id")
es
[7]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id

现在,我们看到该关系已添加到我们的 EntitySet 中。

从现有表创建数据框#

处理原始数据时,通常有足够的信息来支持创建新的数据框。为了创建新的会话数据框及其关系,我们“规范化”交易数据框。

[8]:
es = es.normalize_dataframe(
    base_dataframe_name="transactions",
    new_dataframe_name="sessions",
    index="session_id",
    make_time_index="session_start",
    additional_columns=[
        "device",
        "customer_id",
        "zip_code",
        "session_start",
        "join_date",
    ],
)
es
[8]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 6]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id

查看上面的输出,我们看到此方法执行了两个操作

  1. 它基于“transactions”中的“session_id”和“session_start”列创建了一个名为“sessions”的新数据框

  2. 它添加了一个连接“transactions”和“sessions”的关系

如果我们查看交易数据框和新会话数据框的 schema,我们会看到另外两个自动执行的操作

[9]:
es["transactions"].ww.schema
[9]:
逻辑类型 语义标签
transaction_id Integer ['index']
session_id Integer ['numeric', 'foreign_key']
transaction_time Datetime ['time_index']
product_id Categorical ['foreign_key', 'category']
amount Double ['numeric']
birthday Datetime []
[10]:
es["sessions"].ww.schema
[10]:
逻辑类型 语义标签
session_id Integer ['index']
device Categorical ['category']
customer_id Integer ['numeric']
zip_code PostalCode ['category']
session_start Datetime ['time_index']
join_date Datetime []
  1. 它从“transactions”中移除了“device”、“customer_id”、“zip_code”和“join_date”,并在会话数据框中创建了新列。这减少了冗余信息,因为会话的这些属性在交易之间不会改变。

  2. 它将“session_start”复制并标记为时间索引列到新会话数据框中,以指示会话的开始。如果基础数据框有时间索引且未设置 make_time_indexnormalize_dataframe 将为新数据框创建一个时间索引。在这种情况下,它将使用每个会话的第一个交易的时间创建一个名为“first_transactions_time”的新时间索引。如果我们不希望创建此时间索引,可以将 make_time_index 设置为 False

如果我们查看数据框,我们可以看到 normalize_dataframe 对实际数据做了什么。

[11]:
es["sessions"].head(5)
[11]:
session_id device customer_id zip_code session_start join_date
1 1 desktop 2 13244 2014-01-01 00:00:00 2012-04-15 23:31:04
2 2 mobile 5 60091 2014-01-01 00:17:20 2010-07-17 05:27:50
3 3 mobile 4 60091 2014-01-01 00:28:10 2011-04-08 20:08:14
4 4 mobile 1 60091 2014-01-01 00:44:25 2011-04-17 10:48:33
5 5 mobile 4 60091 2014-01-01 01:11:30 2011-04-08 20:08:14
[12]:
es["transactions"].head(5)
[12]:
transaction_id session_id transaction_time product_id amount birthday
10 10 1 2014-01-01 00:00:00 5 127.64 1986-08-18
2 2 1 2014-01-01 00:01:05 2 109.48 1986-08-18
438 438 1 2014-01-01 00:02:10 3 95.06 1986-08-18
192 192 1 2014-01-01 00:03:15 4 78.92 1986-08-18
271 271 1 2014-01-01 00:04:20 3 31.54 1986-08-18

为了完成此数据集的准备,使用相同的方法调用创建一个“customers”数据框。

[13]:
es = es.normalize_dataframe(
    base_dataframe_name="sessions",
    new_dataframe_name="customers",
    index="customer_id",
    make_time_index="join_date",
    additional_columns=["zip_code", "join_date"],
)

es
[13]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 3]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

使用 EntitySet#

最后,我们准备好在 Featuretools 中使用此 EntitySet 的任何功能。例如,让我们为数据集中的每个产品构建一个特征矩阵。

[14]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="products")

feature_matrix
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7f3215f278b0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7f3215f23940> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7f3215f270d0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7f3215f279d0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7f3215f23f70> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  to_merge = base_frame.groupby(
[14]:
COUNT(transactions) MAX(transactions.amount) MEAN(transactions.amount) MIN(transactions.amount) SKEW(transactions.amount) STD(transactions.amount) SUM(transactions.amount) MODE(transactions.DAY(birthday)) MODE(transactions.DAY(transaction_time)) MODE(transactions.MONTH(birthday)) ... MODE(transactions.sessions.device) NUM_UNIQUE(transactions.DAY(birthday)) NUM_UNIQUE(transactions.DAY(transaction_time)) NUM_UNIQUE(transactions.MONTH(birthday)) NUM_UNIQUE(transactions.MONTH(transaction_time)) NUM_UNIQUE(transactions.WEEKDAY(birthday)) NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) NUM_UNIQUE(transactions.YEAR(birthday)) NUM_UNIQUE(transactions.YEAR(transaction_time)) NUM_UNIQUE(transactions.sessions.device)
product_id
1 102 149.56 73.429314 6.84 0.125525 42.479989 7489.79 18 1 7 ... desktop 4 1 3 1 4 1 5 1 3
2 92 149.95 76.319891 5.73 0.151934 46.336308 7021.43 18 1 8 ... desktop 4 1 3 1 4 1 5 1 3
3 96 148.31 73.001250 5.89 0.223938 38.871405 7008.12 18 1 8 ... desktop 4 1 3 1 4 1 5 1 3
4 106 146.46 76.311038 5.81 -0.132077 42.492501 8088.97 18 1 7 ... desktop 4 1 3 1 4 1 5 1 3
5 104 149.02 76.264904 5.91 0.098248 42.131902 7931.55 18 1 7 ... mobile 4 1 3 1 4 1 5 1 3

5 行 × 25 列

如我们所见,DFS 生成的特征使用了我们 EntitySet 的关系结构。因此,仔细考虑我们创建的数据框非常重要。