指定原始选项#
默认情况下,DFS 会将原始应用于所有数据框和列。此行为可通过几个不同的参数进行更改。可以选择为整个 DFS 运行或基于每个原始忽略或包含数据框和列,从而实现对特征的更大控制并减少运行时开销。
[1]:
import featuretools as ft
from featuretools.tests.testing_utils import make_ecommerce_entityset
es = make_ecommerce_entityset()
features_list = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mode"],
trans_primitives=["weekday"],
features_only=True,
)
features_list
[1]:
[<Feature: age>,
<Feature: région_id>,
<Feature: cohort>,
<Feature: loves_ice_cream>,
<Feature: cancel_reason>,
<Feature: engagement_level>,
<Feature: MODE(sessions.device_name)>,
<Feature: MODE(sessions.device_type)>,
<Feature: MODE(log.countrycode)>,
<Feature: MODE(log.priority_level)>,
<Feature: MODE(log.product_id)>,
<Feature: MODE(log.subregioncode)>,
<Feature: MODE(log.zipcode)>,
<Feature: WEEKDAY(birthday)>,
<Feature: WEEKDAY(cancel_date)>,
<Feature: WEEKDAY(signup_date)>,
<Feature: WEEKDAY(upgrade_date)>,
<Feature: cohorts.cohort_name>,
<Feature: régions.language>,
<Feature: MODE(sessions.MODE(log.countrycode))>,
<Feature: MODE(sessions.MODE(log.priority_level))>,
<Feature: MODE(sessions.MODE(log.product_id))>,
<Feature: MODE(sessions.MODE(log.subregioncode))>,
<Feature: MODE(sessions.MODE(log.zipcode))>,
<Feature: MODE(log.sessions.device_name)>,
<Feature: MODE(log.sessions.device_type)>,
<Feature: cohorts.MODE(customers.cancel_reason)>,
<Feature: cohorts.MODE(customers.engagement_level)>,
<Feature: cohorts.MODE(customers.région_id)>,
<Feature: cohorts.MODE(sessions.device_name)>,
<Feature: cohorts.MODE(sessions.device_type)>,
<Feature: cohorts.MODE(log.countrycode)>,
<Feature: cohorts.MODE(log.priority_level)>,
<Feature: cohorts.MODE(log.product_id)>,
<Feature: cohorts.MODE(log.subregioncode)>,
<Feature: cohorts.MODE(log.zipcode)>,
<Feature: cohorts.WEEKDAY(cohort_end)>,
<Feature: régions.MODE(customers.cancel_reason)>,
<Feature: régions.MODE(customers.engagement_level)>,
<Feature: régions.MODE(sessions.device_name)>,
<Feature: régions.MODE(sessions.device_type)>,
<Feature: régions.MODE(log.countrycode)>,
<Feature: régions.MODE(log.priority_level)>,
<Feature: régions.MODE(log.product_id)>,
<Feature: régions.MODE(log.subregioncode)>,
<Feature: régions.MODE(log.zipcode)>]
指定应用于整个运行的选项#
DFS 的 ignore_dataframes
和 ignore_columns
参数控制应被所有原始忽略的数据框和列。这对于忽略与问题无关或不应包含在 DFS 运行中的列或数据框非常有用。
[2]:
# ignore the 'log' and 'cohorts' dataframes entirely
# ignore the 'birthday' column in 'customers' and the 'device_name' column in 'sessions'
features_list = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mode"],
trans_primitives=["weekday"],
ignore_dataframes=["log", "cohorts"],
ignore_columns={"sessions": ["device_name"], "customers": ["birthday"]},
features_only=True,
)
features_list
[2]:
[<Feature: age>,
<Feature: région_id>,
<Feature: cohort>,
<Feature: loves_ice_cream>,
<Feature: cancel_reason>,
<Feature: engagement_level>,
<Feature: MODE(sessions.device_type)>,
<Feature: WEEKDAY(cancel_date)>,
<Feature: WEEKDAY(signup_date)>,
<Feature: WEEKDAY(upgrade_date)>,
<Feature: régions.language>,
<Feature: régions.MODE(customers.cancel_reason)>,
<Feature: régions.MODE(customers.engagement_level)>,
<Feature: régions.MODE(sessions.device_type)>]
创建特征时,DFS 完全忽略 log
和 cohorts
数据框。它也分别忽略 sessions
和 customers
中的 device_name
和 birthday
列。但是,这两个选项都可以被 primitive_options
参数中的单个原始选项覆盖。
为单个原始指定选项#
单个原始或原始组的选项由 DFS 的 primitive_options
参数设置。此参数将任何所需选项映射到特定原始。如果选项冲突,在此级别设置的选项将覆盖在整个 DFS 运行级别设置的选项,并且包含选项始终优先于其对应的忽略选项。
使用原始的字符串名称或原始类型会将选项应用于所有同名的原始。您还可以通过使用原始实例作为 primitive_options
字典中的键来为原始的特定实例设置选项。但请注意,为特定实例指定选项将导致该实例忽略通过使用原始名称或类作为键的选项为通用原始设置的任何选项。
为单个原始指定数据框#
哪些数据框应包含/忽略也可以为单个原始或原始组指定。可以使用 primitive_options
中的 ignore_dataframes
选项忽略数据框,而要明确包含的数据框则通过 include_dataframes
选项设置。当给出 include_dataframes
时,未列出的所有数据框都将被该原始忽略。任何被排除的数据框中的列都不会用于使用给定的原始生成特征。
[3]:
# ignore the 'cohorts' and 'log' dataframes, but only for the primitive 'mode'
# include only the 'customers' dataframe for the primitives 'weekday' and 'day'
features_list = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mode"],
trans_primitives=["weekday", "day"],
primitive_options={
"mode": {"ignore_dataframes": ["cohorts", "log"]},
("weekday", "day"): {"include_dataframes": ["customers"]},
},
features_only=True,
)
features_list
[3]:
[<Feature: age>,
<Feature: région_id>,
<Feature: cohort>,
<Feature: loves_ice_cream>,
<Feature: cancel_reason>,
<Feature: engagement_level>,
<Feature: MODE(sessions.device_name)>,
<Feature: MODE(sessions.device_type)>,
<Feature: DAY(birthday)>,
<Feature: DAY(cancel_date)>,
<Feature: DAY(signup_date)>,
<Feature: DAY(upgrade_date)>,
<Feature: WEEKDAY(birthday)>,
<Feature: WEEKDAY(cancel_date)>,
<Feature: WEEKDAY(signup_date)>,
<Feature: WEEKDAY(upgrade_date)>,
<Feature: cohorts.cohort_name>,
<Feature: régions.language>,
<Feature: cohorts.MODE(customers.cancel_reason)>,
<Feature: cohorts.MODE(customers.engagement_level)>,
<Feature: cohorts.MODE(customers.région_id)>,
<Feature: cohorts.MODE(sessions.device_name)>,
<Feature: cohorts.MODE(sessions.device_type)>,
<Feature: régions.MODE(customers.cancel_reason)>,
<Feature: régions.MODE(customers.engagement_level)>,
<Feature: régions.MODE(sessions.device_name)>,
<Feature: régions.MODE(sessions.device_type)>]
在此示例中,DFS 将仅对 weekday
和 day
使用 customers
数据框,并对 mode
使用除 cohorts
和 log
之外的所有数据框。
为单个原始指定列#
特定列也可以为原始或原始组明确包含/忽略。要忽略的列由 ignore_columns
选项设置,而要包含的列由 include_columns
设置。设置 include_columns
选项后,该数据框中的其他列将不会用于使用给定的原始生成特征。
[4]:
# Include the columns 'product_id' and 'zipcode', 'device_type', and 'cancel_reason' for 'mean'
# Ignore the columns 'signup_date' and 'cancel_date' for 'weekday'
features_list = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mode"],
trans_primitives=["weekday"],
primitive_options={
"mode": {
"include_columns": {
"log": ["product_id", "zipcode"],
"sessions": ["device_type"],
"customers": ["cancel_reason"],
}
},
"weekday": {"ignore_columns": {"customers": ["signup_date", "cancel_date"]}},
},
features_only=True,
)
features_list
[4]:
[<Feature: age>,
<Feature: région_id>,
<Feature: cohort>,
<Feature: loves_ice_cream>,
<Feature: cancel_reason>,
<Feature: engagement_level>,
<Feature: MODE(sessions.device_type)>,
<Feature: MODE(log.product_id)>,
<Feature: MODE(log.zipcode)>,
<Feature: WEEKDAY(birthday)>,
<Feature: WEEKDAY(upgrade_date)>,
<Feature: cohorts.cohort_name>,
<Feature: régions.language>,
<Feature: MODE(sessions.MODE(log.product_id))>,
<Feature: MODE(sessions.MODE(log.zipcode))>,
<Feature: MODE(log.sessions.device_type)>,
<Feature: cohorts.MODE(customers.cancel_reason)>,
<Feature: cohorts.MODE(sessions.device_type)>,
<Feature: cohorts.MODE(log.product_id)>,
<Feature: cohorts.MODE(log.zipcode)>,
<Feature: cohorts.WEEKDAY(cohort_end)>,
<Feature: régions.MODE(customers.cancel_reason)>,
<Feature: régions.MODE(sessions.device_type)>,
<Feature: régions.MODE(log.product_id)>,
<Feature: régions.MODE(log.zipcode)>]
这里,mode
将仅使用数据框 log
中的 product_id
和 zipcode
列,数据框 sessions
中的 device_type
,以及 customers
中的 cancel_reason
。对于任何其他数据框,mode
将使用所有列。weekday
原始将使用所有数据框中的所有列,但 customers
数据框中的 signup_date
和 cancel_date
除外。
指定 GroupBy 选项#
GroupBy Transform Primitives 还有额外的选项 include_groupby_dataframes
、ignore_groupby_dataframes
、include_groupby_columns
和 ignore_groupby_columns
。这些选项用于指定要包含/忽略作为输入的组合(groupings)的数据框和列。默认情况下,DFS 仅按外键列分组。指定 include_groupby_columns
会覆盖此默认行为,并且仅按给定的列分组。另一方面,ignore_groupby_columns
将继续仅使用外键列,忽略指定的也是外键列的任何列。请注意,如果要包含非外键列进行分组,则包含的列必须是分类列。
[5]:
features_list = ft.dfs(
entityset=es,
target_dataframe_name="log",
agg_primitives=[],
trans_primitives=[],
groupby_trans_primitives=["cum_sum", "cum_count"],
primitive_options={
"cum_sum": {"ignore_groupby_columns": {"log": ["product_id"]}},
"cum_count": {
"include_groupby_columns": {"log": ["product_id", "priority_level"]},
"ignore_groupby_dataframes": ["sessions"],
},
},
features_only=True,
)
features_list
[5]:
[<Feature: session_id>,
<Feature: product_id>,
<Feature: value>,
<Feature: value_2>,
<Feature: zipcode>,
<Feature: countrycode>,
<Feature: subregioncode>,
<Feature: value_many_nans>,
<Feature: priority_level>,
<Feature: purchased>,
<Feature: CUM_COUNT(countrycode) by priority_level>,
<Feature: CUM_COUNT(countrycode) by product_id>,
<Feature: CUM_COUNT(priority_level) by priority_level>,
<Feature: CUM_COUNT(priority_level) by product_id>,
<Feature: CUM_COUNT(product_id) by priority_level>,
<Feature: CUM_COUNT(product_id) by product_id>,
<Feature: CUM_COUNT(subregioncode) by priority_level>,
<Feature: CUM_COUNT(subregioncode) by product_id>,
<Feature: CUM_COUNT(zipcode) by priority_level>,
<Feature: CUM_COUNT(zipcode) by product_id>,
<Feature: CUM_SUM(value) by session_id>,
<Feature: CUM_SUM(value_2) by session_id>,
<Feature: CUM_SUM(value_many_nans) by session_id>,
<Feature: sessions.customer_id>,
<Feature: sessions.device_type>,
<Feature: sessions.device_name>,
<Feature: products.department>,
<Feature: products.rating>,
<Feature: sessions.customers.age>,
<Feature: sessions.customers.région_id>,
<Feature: sessions.customers.cohort>,
<Feature: sessions.customers.loves_ice_cream>,
<Feature: sessions.customers.cancel_reason>,
<Feature: sessions.customers.engagement_level>,
<Feature: CUM_COUNT(countrycode) by products.department>,
<Feature: CUM_COUNT(priority_level) by products.department>,
<Feature: CUM_COUNT(product_id) by products.department>,
<Feature: CUM_COUNT(products.department) by priority_level>,
<Feature: CUM_COUNT(products.department) by product_id>,
<Feature: CUM_COUNT(sessions.device_name) by priority_level>,
<Feature: CUM_COUNT(sessions.device_name) by product_id>,
<Feature: CUM_COUNT(sessions.device_name) by products.department>,
<Feature: CUM_COUNT(sessions.device_type) by priority_level>,
<Feature: CUM_COUNT(sessions.device_type) by product_id>,
<Feature: CUM_COUNT(sessions.device_type) by products.department>,
<Feature: CUM_COUNT(subregioncode) by products.department>,
<Feature: CUM_COUNT(zipcode) by products.department>,
<Feature: CUM_SUM(products.rating) by session_id>,
<Feature: CUM_SUM(products.rating) by sessions.customer_id>,
<Feature: CUM_SUM(value) by sessions.customer_id>,
<Feature: CUM_SUM(value_2) by sessions.customer_id>,
<Feature: CUM_SUM(value_many_nans) by sessions.customer_id>]
我们忽略 product_id
作为 cum_sum
的分组,但仍然使用该数据框或任何其他数据框中的任何其他外键列。对于 cum_count
,我们仅使用 product_id
和 priority_level
作为分组。请注意,cum_sum
不使用 priority_level
,因为它不是外键列,但我们为 cum_count
明确包含它。最后,请注意,指定分组选项不会影响原始应用于哪些特征。例如,cum_count
在分组时忽略数据框 sessions
,但特征 <Feature: CUM_COUNT(sessions.device_name) by product_id>
仍然生成。分组来自目标数据框 log
,因此在给定相关选项的情况下,该特征是有效的。要忽略 cum_count
的数据框 sessions
,cum_count
的 ignore_dataframes
选项需要包含 sessions
。
为多输入原始的每个输入指定选项#
对于接受多个列作为输入的原始,例如 Trend
,可以通过将其作为列表传入来为每个输入指定上述选项。如果仅提供一个选项字典,它将用于所有输入。提供的列表长度必须与原始接受的输入数量匹配。
[6]:
features_list = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["trend"],
trans_primitives=[],
primitive_options={
"trend": [
{"ignore_columns": {"log": ["value_many_nans"]}},
{"include_columns": {"customers": ["signup_date"], "log": ["datetime"]}},
]
},
features_only=True,
)
features_list
[6]:
[<Feature: age>,
<Feature: région_id>,
<Feature: cohort>,
<Feature: loves_ice_cream>,
<Feature: cancel_reason>,
<Feature: engagement_level>,
<Feature: TREND(log.value, datetime)>,
<Feature: TREND(log.value_2, datetime)>,
<Feature: cohorts.cohort_name>,
<Feature: régions.language>,
<Feature: cohorts.TREND(customers.age, signup_date)>,
<Feature: cohorts.TREND(log.value, datetime)>,
<Feature: cohorts.TREND(log.value_2, datetime)>,
<Feature: régions.TREND(customers.age, signup_date)>,
<Feature: régions.TREND(log.value, datetime)>,
<Feature: régions.TREND(log.value_2, datetime)>]
在这里,我们为趋势原始传入一个选项列表。我们忽略作为 trend
第一个输入的 value_many_nans
列,并包含 customers
中的 signup_date
列作为第二个输入。