Feature Tools

This part of the library is designed for feature selection.

quest_selection

It follows these instructions:

Calculate the accuracy without feature selection, if it was not calculated.
In each iteration, ignore one feature. For all features, do it. After that, calculate the new accuracy. If it is greater than or equal to the general accuracy or the difference between them is smaller than or equal to flag_one_tol, then take this feature to the next step.
Create random combinations with these features and calculate the new accuracy. If one of the combinations provides an accuracy that is greater than or equal to general accuracy or their difference is smaller than or equal to fin_tol, then print it out to the console.

Parameters	Datatype	Default Value
model_class	any AI model class
X_train	multidimensional array
y_train	1D array
X_test	multidimensional array
y_test	1D array
features	list
flag_one_tol	float
fin_tol	float
params	dict	None
normal_acc	float	None
trials	integer	100

Note

features need to contain the names of the columns in X array.

Note

params is for the model, model does not have to be created in default settings, it can be manipulated.

Note

normal_acc is the accuracy score for dataset without feature selection.

Attention

trials dedicates the third step to how many times it will be repeated.

Note

This function only prints out to the console, nothing returns.

list_deletings

It analyses the dataframe and shows or deletes columns that were found to be improper by the algorithm.

Parameters	Datatype	Default Value
df	pandas dataframe
extra	list	None
del_null	boolean	True
null_tolerance	integer or float	20
del_single	boolean	True
del_almost_single	boolean	False
almost_tolerance	integer or float	50
suggest_extra	boolean	True
return_extra	boolean	False
unique_tolerance	integer or boolean	10

Note

The column which their names are inside the extra library are deleted directly.

Note

While del_null is true, the columns that have null values greater than null_tolerance% of the total sample amount are deleted.

Note

If del_single is true, then the columns that have only one different value are deleted.

Note

While del_almost_single is true, the columns that have the same value that more than almost_tolerance% of the total sample amount are deleted.

Note

While suggest_extra is true, the string data held columns that have unique values greater than unique_tolerance% of the total sample amount are printed out in the console. The list of columns is also returned if return_extra is true.

Priority (in return)	Returns	Datatype	Condition
1	df	pandas dataframe	always
2	columns	list	return_extra is True

multi_split

It is designed to create different arrays with requested threshold values for train-test splitting.

Parameters	Datatype	Default Value
df	pandas dataframe
test_size	float
output	string
threshold_set	list

Note

test_size must be between 0 and 1.

Attention

The very first multidimensional array inside lists is always created without any selection.

Attention

The output arrays are created in order to the order inside threshold_set.

Attention

Output arrays (y_train and y_test) are always the same for every output set. It is because measuring correctly the success rate between different approaches.

Priority (in return)	Returns	Datatype	Condition
1	X_trains	list	always
2	X_tests	list	always
3	y_train	1D array	always
4	y_test	1D array	always

rand_arr

It creates a one dimensional randomized array with sticking on a strategy.

Parameters

Datatype

Default Value

outputs

list

values

int list

None

strategy

string

equal

arr_size

int

1

Note

There are three determined strategies: ‘equal’, ‘weighted’ and ‘piled’. In ‘equal’, each output has the same probability. In ‘weighted’, the ith output has values[i] / sum(values) probability to come out. With a small difference in ‘piled’, the ith output has (values[i] - values[i-1] (or 0 if i is zero)) / values[-1] probability. Because of the behavior of piled, the values array must be sorted.

Priority (in return)	Returns	Datatype	Condition
1	column	1D Array	always