Data Tools

This part of the library is designed for data analysis and data manipulation.

col_types

The function gets information about datatypes for each column inside the dataframe.

Parameters

Datatype

Default Value

df

pandas dataframe

print_columns

boolean

False

Attention

It prints out the results if print_columns is True.

Priority (in return)

Returns

Datatype

Condition

1

types

list

always

import pandas as pd
from wolta.data_tools import col_types

df = pd.read_csv('data.csv')

columns = list(df.columns)
types = col_types(df)

# prints out the datatype for each column
for i in range(len(columns)):
    print('{}: {}'.format(columns[i], types[i]))

#or just
col_types(df, print_columns=True)

unique_amounts

The function gets info about unique value amounts for each column inside the dataframe.

Parameters

Datatype

Default Value

df

pandas dataframe

strategy

list

None

print_dict

boolean

False

Attention

If only requested columns are wanted to be examined, then they must be indicated inside the strategy list. Also, it prints out the results if print_dict is True.

Priority (in return)

Returns

Datatype

Condition

1

space

dict

always

make_numerics

This function converts categorical data to numerical data.

Parameters

Datatype

Default Value

column

1D array

space_requested

boolean

False

Priority (in return)

Returns

Datatype

Condition

1

column

1D array

always

2

space

dict

space_requested is True

scale_x

It scales input values by using Sklearn’s Standard Scaler.

Parameters

Datatype

Default Value

X_train

multi dimensional array

X_test

multi dimensional array

Priority (in return)

Returns

Datatype

Condition

1

X_train

multi dimensional array

always

2

X_test

multi dimensional array

always

examine_floats

It examines requested columns that are pre-accepted as float if they have only integer values or not.

Parameters

Datatype

Default Value

df

pandas dataframe

float_columns

list

get

string

float

Hint

float_columns contains the names of the columns that will be checked by the algorithm.

Hint

get may take two different values: ‘float’ and ‘int’. If it equals to float, it returns non-integer column names. Unless, it returns integer column names.

Priority (in return)

Returns

Datatype

Condition

1

space

list

always

calculate_bounds

It returns required datatypes to hold values according to the max and min values.

Note

Sometimes datasets can be huge for our system’s capabilities. At this point, decreasing the required space might be essential. This function is designed with this purpose.

Tip

At this rate, I also suggest you use Dask library to get better results.

Parameters

Datatype

Default Value

gen_types

list

min_val

int or float

max_val

int or float

Hint

gen_types can be easily obtained by using col_types function.

Priority (in return)

Returns

Datatype

Condition

1

types

list

always

calculate_min_max

It provides very detailed information about min values, max values and datatypes for each column.

Note

This function might be beneficial because it is designed with an approach that does not load all data into memory at once if the device is incapable of doing this.

Hint

The separation into multiple small data files is suggested for the dataset. In further, you may use Glob library in order to obtain paths easily.

Parameters

Datatype

Default Value

paths

list

deleted_columns

list

None

Attention

The indicated columns inside deleted_columns will be excluded during the process.

Priority (in return)

Returns

Datatype

Condition

1

columns

list

always

2

columns

list

always

3

max_val

list

always

4

min_val

list

always

load_by_parts

It enables to load multiple subsets of a dataset into a big one with extensive options.

Parameters

Datatype

Default Value

paths

list

strategy

string

default

deleted_columns

list

None

print_description

boolean

False

shuffle

boolean

False

encoding

string

utf-8

Note

strategy can have two different values: ‘default’ and ‘efficient’. The only difference between them is that efficient detects datatypes by using calculate_bounds function. It is not suggested if it is not strictly required.

Note

encoding can have all valid values for encoding parameter of the pandas’ read_csv function.

Priority (in return)

Returns

Datatype

Condition

1

df

pandas dataframe

always

create_chunks

It splits the dataset into small csv files.

Parameters

Datatype

Default Value

path

string

sample_amount

integer

target_dir

string

None

print_description

boolean

False

chunk_name

string

part

Attention

This function does not return any value as result.

transform_data

It transforms data by using some predetermined techniques. Further reading, you may read Transformations article.

Parameters

Datatype

Default Value

X

multi dimensional array

y

1D array

strategy

string

log-m

Note

strategy can have these values: ‘log’, ‘log-m’, ‘log2’, ‘log2-m’, ‘log10’, ‘log10-m’, ‘sqrt’, ‘sqrt-m’, ‘cbrt’

Priority (in return)

Returns

Datatype

Condition

1

X

multi dimensional array

always

2

y

1D array

always

3

amin_x

integer or float

strategy ends with -m

4

amin_y

integer or float

strategy ends with -m

transform_pred

It may seem like a reverse function for transform_data. For further reading, you may read Transformations article.

Parameters

Datatype

Default Value

y_pred

1D array

strategy

string

log-m

amin_y

integer or float

0

Note

strategy can have these values: ‘log’, ‘log-m’, ‘log2’, ‘log2-m’, ‘log10’, ‘log10-m’, ‘sqrt’, ‘sqrt-m’, ‘cbrt’, ‘cbrt-m’

Priority (in return)

Returns

Datatype

Condition

1

y_pred

1D array

always

make_categorical

It turns continuous values into discrete ones by using normal distribution. The function separates three groups the given data with this approach. For further reading, you may read Categorization With Normal Distribution article.

Parameters

Datatype

Default Value

y

1D array

strategy

string

normal

Note

strategy can have these values: ‘normal’ and ‘normal-extra’.

is_normal

It controls that given set behaves like normal distribution or not.

Parameters

Datatype

Default Value

y

1D array

Priority (in return)

Returns

Datatype

Condition

1

result

boolean

always

seek_null

It checks each column in the dataframe to see if they have null values or not and how many if there are any. After the process it returns a list full of the names of the columns which have null values.

Parameters

Datatype

Default Value

df

pandas dataframe

print_columns

boolean

False

Attention

It is visible that how many null values have the columns, when print_columns is True. The values are printed out to the console.

Priority (in return)

Returns

Datatype

Condition

1

null_columns

list

always

make_null

Sometimes the null values may be represented in different ways (using ‘unknown’ in string data for example) instead of being null inside the dataset.

Parameters

Datatype

Default Value

matrix

pandas dataframe or numpy array

replace

anything

type

string

df

Attention

type declares that matrix is a pandas dataframe or numpy array. It has two different valid values, which are ‘df’ and ‘np’. ‘df’ means pandas dataframe, ‘np’ means numpy array.

Priority (in return)

Returns

Datatype

Condition

1

matrix

pandas dataframe or numpy array

always

stat_sum

It summarises the collected info about dataframe like describe method from Pandas.

Parameters

Datatype

Default Value

df

pandas dataframe

requested

list

only

list

None

exclude

list

None

get_dict

boolean

False

verbose

boolean

True

Tip

If only is not None, only indicated columns in the list are examined.

Tip

If exclude is not None, all columns except indicated in the list are examined.

Here is the list of valid keywords for requested:

Valid Keyword

Meaning

all

if the list has it at index zero then it presumes that it contains all keywords

min

minimum

max

maximum

width

the difference between max and min

mean

arithmetic mean

std

standard deviation

med

median

var

variance

Priority (in return)

Returns

Datatype

Condition

1

gen_results

dictionary

if get_dict is True

extract_float

Sometimes float data might be held with a different representation (‘3.5$’ for example). In that case, the unwanted symbols (the dollar sign in this example) can be deleted and the datatype of the list can be converted from string to float.

Parameters

Datatype

Default Value

column

1D array

symbols

list

Priority (in return)

Returns

Datatype

Condition

1

column

1D Array

always

col_counts

It returns the frequency of unique values in columns by using value_counts function from pandas.

Parameters

Datatype

Default Value

df

pandas dataframe

exclude

list

None

only

list

None

Tip

If only is not None, only indicated columns in the list are examined.

Tip

If exclude is not None, all columns except indicated in the list are examined.

Attention

This function only prints out the result to the console. It does not return anything.

check_similarity

Sometimes the very same information can be held into two different columns with different representations. For example, the area code information can be stored with digits and their actual names into two different columns, but they hold the same thing.

Parameters

Datatype

Default Value

col1

1D array

col2

1D array

Priority (in return)

Returns

Datatype

Condition

1

similarity

boolean

always

Attention

If similarity is true, then it means that these columns have the same information.

find_broke

If the datatype of the column is different than expected, it can be examined by using this method and found the reason.

Parameters

Datatype

Default Value

column

1D array

dtype

datatype

float

get_indexes

boolean

True

get_words

boolean

False

verbose

boolean

True

verbose_limit

integer

10

Priority (in return)

Returns

Datatype

Condition

1

indexes

list

get_indexes is True

2

words

list

get_words is True

expand_df

It oversamples the dataset by using SMOTE.

Parameters

Datatype

Default Value

df

pandas dataframe

output

string

sampling_strategy

string or dictionary

Priority (in return)

Returns

Datatype

Condition

1

df

pandas dataframe

always

split_as_df

It splits X and y arrays into train and test pandas dataframes instead of arrays.

Parameters

Datatype

Default Value

X

multidimensional array

y

1D array

features

list

output

string

test_size

float

random_state

int

42

shuffle

boolean

True

stratify

1D array

None

Note

features are the list of the names of columns in the X array. output is the name of the column in the y array.

Priority (in return)

Returns

Datatype

Condition

1

dftrain

pandas dataframe

always

2

dftest

pandas dataframe

always

train_test_val_split

It splits the data into three groups: train, validation and test.

Parameters

Datatype

Default Value

X

multidimensional array

y

1D array

test_size

float

val_size

float

random_state

int

42

shuffle

boolean

True

stratify

1D array

None

stratify_for_val

boolean

True

Attention

The ratios given as test_size and val_size must be for the sum of the data.

Priority (in return)

Returns

Datatype

Condition

1

X_train

multidimensional array

always

2

X_test

multidimensional array

always

3

X_val

multidimensional array

always

4

y_train

1D array

always

5

y_test

1D array

always

6

y_val

1D array

always

synthetic_expand

It generates synthetic data based on actual data. It is designed to create datasets for educational purposes in data science.

Parameters

Datatype

Default Value

df

pandas dataframe

feature_info

dictionary

shape_zero

integer

Attention

feature_info must hold every requested feature name inside its keys from the dataframe. Values may take two different values: continuous and discrete.

Priority (in return)

Returns

Datatype

Condition

1

df

pandas dataframe

always

multi_split

It splits the dataset, which is multi label. The function shuffles the dataset n times and, each time, seeks for the same sample-amount distribution in the test for classes in every label.

Parameters

Datatype

Default Value

df

pandas dataframe

labels

list

test_size

float

times

integer

50

Note

labels holds the list of the label names in the dataframe.

Priority (in return)

Returns

Datatype

Condition

1

X_train

multidimensional array

always

2

X_test

multidimensional array

always

3

y_trains

dictionary

always

4

y_tests

dictionary

always

Attention

y_trains and y_tests have a key value structure as label name - 1D array.

corr_analyse

It calculates the correlation between each feature in the dataset. After calculation, it also groups them by respect to their strength.

Attention

Each correlation value must be between -1 and 1. In the grouping phase, the range is accepted between 0 and 1, the negative side is also accepted as symmetric to the positive side. After that, the new range is split into four groups, which are ‘uncorrelated’, ‘weak’, ‘strong’ and ‘perfect’.

group

interval

uncorrelated

0 <= y <= un_w

weak

un_w < score <= w_s

strong

w_s < score <= s_p

perfect

s_p < score <= 1

Parameters

Datatype

Default Value

array

2D array

columns

list

un_w

float

0.1

w_s

float

0.5

s_p

float

0.9

verbose

boolean

True

get_matrix

boolean

False

csv_path

string

None

Note

If csv_path is not none then the scores are logged in a csv file.

Note

The function always returns a dictionary with keys ‘columns’ and ‘score’.

Priority (in return)

Returns

Datatype

Condition

1

results

dictionary

always

2

matrix

2D array

get_matrix is True

scale_df

It scales the values of the pandas dataframe and returns a pandas dataframe again. If requested, output label(s) might be excluded from this process.

Parameters

Datatype

Default Value

df

pandas dataframe

output

string or string list

None

mode

string

minmax

params

dictionary

None

Note

The function supports four different scaling methods: MinMaxScaler (when mode is ‘minmax’), StandardScaler (when mode is ‘standard’), RobustScaler (when mode is ‘robust’) and MaxAbsScaler (when mode is ‘maxabs’).

Priority (in return)

Returns

Datatype

Condition

1

df

pandas dataframe

always

corr_high

It gets the names of the features that have a high correlation with the output label.

Parameters

Datatype

Default Type

df

pandas dataframe

output

string

strengths

string list

[‘perfect’, ‘strong’, ‘weak’]

verbose

boolean

True

Note

strengths can only contain ‘perfect’, ‘strong’, ‘weak’ and ‘uncorrelated’.

Priority (in return)

Returns

Datatype

Condition

1

feature_high

string list

always