特征选择#
Featuretools 提供了移除在构建有效机器学习模型中不太可能有用的特征的能力。减少特征矩阵中的特征数量既可以提高模型效果,也可以降低预测所需的计算成本。
Featuretools 允许用户使用以下三个函数对深度特征合成的结果执行特征选择
ft.selection.remove_highly_null_features
ft.selection.remove_single_value_features
ft.selection.remove_highly_correlated_features
我们将深入描述这些函数,但首先我们必须创建一个实体集,以便运行 ft.dfs
。
[1]:
import pandas as pd
import featuretools as ft
from featuretools.demo.flight import load_flight
from featuretools.selection import (
remove_highly_correlated_features,
remove_highly_null_features,
remove_single_value_features,
)
es = load_flight(nrows=50)
es
Downloading data ...
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/demo/flight.py:291: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data.loc[:, "dep_time"] = clean_data["scheduled_dep_time"] + pd.to_timedelta(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/demo/flight.py:296: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data.loc[:, "arr_time"] = clean_data["dep_time"] + pd.to_timedelta(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/demo/flight.py:302: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data["scheduled_dep_time"] + clean_data["scheduled_elapsed_time"]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
[1]:
Entityset: Flight Data
DataFrames:
trip_logs [Rows: 50, Columns: 21]
flights [Rows: 6, Columns: 9]
airlines [Rows: 1, Columns: 1]
airports [Rows: 4, Columns: 3]
Relationships:
trip_logs.flight_id -> flights.flight_id
flights.carrier -> airlines.carrier
flights.dest -> airports.dest
移除高缺失特征#
我们的数据集可能包含许多缺失值的列。深度特征合成可能会基于这些缺失列构建特征,从而创建更多高缺失特征。在这种情况下,我们可能希望移除任何缺失值百分比超过某个阈值的特征。下面是我们的特征矩阵中存在这种情况的示例
[2]:
fm, features = ft.dfs(
entityset=es,
target_dataframe_name="trip_logs",
cutoff_time=pd.DataFrame(
{
"trip_log_id": [30, 1, 2, 3, 4],
"time": pd.to_datetime(["2016-09-22 00:00:00"] * 5),
}
),
trans_primitives=[],
agg_primitives=[],
max_depth=2,
)
fm
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/entityset/entityset.py:1455: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.loc[mask, columns] = np.nan
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/entityset/entityset.py:1455: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.loc[mask, columns] = np.nan
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
[2]:
flight_id | dep_delay | taxi_out | taxi_in | arr_delay | diverted | air_time | distance | carrier_delay | weather_delay | national_airspace_delay | security_delay | late_aircraft_delay | canceled | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||||||||||||||
30 | AA-494:RSW->CLT | NaN | NaN | NaN | NaN | <NA> | NaN | 600.0 | NaN | NaN | NaN | NaN | NaN | <NA> | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
我们查看上面的特征矩阵,并决定移除高缺失特征
[3]:
ft.selection.remove_highly_null_features(fm)
[3]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
请注意,调用 remove_highly_null_features
并未移除所有包含缺失值的特征。默认情况下,我们只移除计算出的特征矩阵中缺失值百分比高于 95% 的特征。如果想降低此阈值,我们可以自行设置 pct_null_threshold
参数。
[4]:
remove_highly_null_features(fm, pct_null_threshold=0.2)
[4]:
trip_log_id |
---|
30 |
1 |
2 |
3 |
4 |
移除单值特征#
我们可能遇到的另一种情况是,计算出的特征没有任何方差。在这种情况下,我们很可能希望移除这些无趣的特征。为此,我们使用 remove_single_value_features
。
让我们看看移除下面特征矩阵中的单值特征会发生什么。
[5]:
fm
[5]:
flight_id | dep_delay | taxi_out | taxi_in | arr_delay | diverted | air_time | distance | carrier_delay | weather_delay | national_airspace_delay | security_delay | late_aircraft_delay | canceled | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||||||||||||||
30 | AA-494:RSW->CLT | NaN | NaN | NaN | NaN | <NA> | NaN | 600.0 | NaN | NaN | NaN | NaN | NaN | <NA> | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
注意
可以将特征定义列表(例如由 dfs 创建的列表)提供给特征选择函数。这样做将更改输出,使其包含更新的特征定义列表。
[6]:
new_fm, new_features = remove_single_value_features(fm, features=features)
new_fm
[6]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
现在我们有了更新后的特征矩阵的特征定义,可以看到被移除的特征是
[7]:
set(features) - set(new_features)
[7]:
{<Feature: air_time>,
<Feature: arr_delay>,
<Feature: canceled>,
<Feature: carrier_delay>,
<Feature: dep_delay>,
<Feature: diverted>,
<Feature: flights.carrier>,
<Feature: flights.flight_num>,
<Feature: late_aircraft_delay>,
<Feature: national_airspace_delay>,
<Feature: security_delay>,
<Feature: taxi_in>,
<Feature: taxi_out>,
<Feature: weather_delay>}
如上所示使用该函数时,在计算特征的唯一值时不会考虑缺失值。如果希望将 NaN
视为一个独立的值,可以将 count_nan_as_value
设置为 True
,然后就会在矩阵中看到 flights.carrier
和 flights.flight_num
。
[8]:
new_fm, new_features = remove_single_value_features(
fm, features=features, count_nan_as_value=True
)
new_fm
[8]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
被移除的特征是
[9]:
set(features) - set(new_features)
[9]:
{<Feature: air_time>,
<Feature: arr_delay>,
<Feature: canceled>,
<Feature: carrier_delay>,
<Feature: dep_delay>,
<Feature: diverted>,
<Feature: late_aircraft_delay>,
<Feature: national_airspace_delay>,
<Feature: security_delay>,
<Feature: taxi_in>,
<Feature: taxi_out>,
<Feature: weather_delay>}