Data Tools
This part of the library is designed for data analysis and data manipulation.
col_types
The function gets information about datatypes for each column inside the dataframe.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
print_columns |
boolean |
False |
Attention
It prints out the results if print_columns is True.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
types |
list |
always |
import pandas as pd
from wolta.data_tools import col_types
df = pd.read_csv('data.csv')
columns = list(df.columns)
types = col_types(df)
# prints out the datatype for each column
for i in range(len(columns)):
print('{}: {}'.format(columns[i], types[i]))
#or just
col_types(df, print_columns=True)
unique_amounts
The function gets info about unique value amounts for each column inside the dataframe.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
strategy |
list |
None |
print_dict |
boolean |
False |
Attention
If only requested columns are wanted to be examined, then they must be indicated inside the strategy list. Also, it prints out the results if print_dict is True.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
space |
dict |
always |
make_numerics
This function converts categorical data to numerical data.
Parameters |
Datatype |
Default Value |
|---|---|---|
column |
1D array |
|
space_requested |
boolean |
False |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
column |
1D array |
always |
2 |
space |
dict |
space_requested is True |
scale_x
It scales input values by using Sklearn’s Standard Scaler.
Parameters |
Datatype |
Default Value |
|---|---|---|
X_train |
multi dimensional array |
|
X_test |
multi dimensional array |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
X_train |
multi dimensional array |
always |
2 |
X_test |
multi dimensional array |
always |
examine_floats
It examines requested columns that are pre-accepted as float if they have only integer values or not.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
float_columns |
list |
|
get |
string |
float |
Hint
float_columns contains the names of the columns that will be checked by the algorithm.
Hint
get may take two different values: ‘float’ and ‘int’. If it equals to float, it returns non-integer column names. Unless, it returns integer column names.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
space |
list |
always |
calculate_bounds
It returns required datatypes to hold values according to the max and min values.
Note
Sometimes datasets can be huge for our system’s capabilities. At this point, decreasing the required space might be essential. This function is designed with this purpose.
Tip
At this rate, I also suggest you use Dask library to get better results.
Parameters |
Datatype |
Default Value |
|---|---|---|
gen_types |
list |
|
min_val |
int or float |
|
max_val |
int or float |
Hint
gen_types can be easily obtained by using col_types function.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
types |
list |
always |
calculate_min_max
It provides very detailed information about min values, max values and datatypes for each column.
Note
This function might be beneficial because it is designed with an approach that does not load all data into memory at once if the device is incapable of doing this.
Hint
The separation into multiple small data files is suggested for the dataset. In further, you may use Glob library in order to obtain paths easily.
Parameters |
Datatype |
Default Value |
|---|---|---|
paths |
list |
|
deleted_columns |
list |
None |
Attention
The indicated columns inside deleted_columns will be excluded during the process.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
columns |
list |
always |
2 |
columns |
list |
always |
3 |
max_val |
list |
always |
4 |
min_val |
list |
always |
load_by_parts
It enables to load multiple subsets of a dataset into a big one with extensive options.
Parameters |
Datatype |
Default Value |
|---|---|---|
paths |
list |
|
strategy |
string |
default |
deleted_columns |
list |
None |
print_description |
boolean |
False |
shuffle |
boolean |
False |
encoding |
string |
utf-8 |
Note
strategy can have two different values: ‘default’ and ‘efficient’. The only difference between them is that efficient detects datatypes by using calculate_bounds function. It is not suggested if it is not strictly required.
Note
encoding can have all valid values for encoding parameter of the pandas’ read_csv function.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
df |
pandas dataframe |
always |
create_chunks
It splits the dataset into small csv files.
Parameters |
Datatype |
Default Value |
|---|---|---|
path |
string |
|
sample_amount |
integer |
|
target_dir |
string |
None |
print_description |
boolean |
False |
chunk_name |
string |
part |
Attention
This function does not return any value as result.
transform_data
It transforms data by using some predetermined techniques. Further reading, you may read Transformations article.
Parameters |
Datatype |
Default Value |
|---|---|---|
X |
multi dimensional array |
|
y |
1D array |
|
strategy |
string |
log-m |
Note
strategy can have these values: ‘log’, ‘log-m’, ‘log2’, ‘log2-m’, ‘log10’, ‘log10-m’, ‘sqrt’, ‘sqrt-m’, ‘cbrt’
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
X |
multi dimensional array |
always |
2 |
y |
1D array |
always |
3 |
amin_x |
integer or float |
strategy ends with -m |
4 |
amin_y |
integer or float |
strategy ends with -m |
transform_pred
It may seem like a reverse function for transform_data. For further reading, you may read Transformations article.
Parameters |
Datatype |
Default Value |
|---|---|---|
y_pred |
1D array |
|
strategy |
string |
log-m |
amin_y |
integer or float |
0 |
Note
strategy can have these values: ‘log’, ‘log-m’, ‘log2’, ‘log2-m’, ‘log10’, ‘log10-m’, ‘sqrt’, ‘sqrt-m’, ‘cbrt’, ‘cbrt-m’
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
y_pred |
1D array |
always |
make_categorical
It turns continuous values into discrete ones by using normal distribution. The function separates three groups the given data with this approach. For further reading, you may read Categorization With Normal Distribution article.
Parameters |
Datatype |
Default Value |
|---|---|---|
y |
1D array |
|
strategy |
string |
normal |
Note
strategy can have these values: ‘normal’ and ‘normal-extra’.
is_normal
It controls that given set behaves like normal distribution or not.
Parameters |
Datatype |
Default Value |
|---|---|---|
y |
1D array |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
result |
boolean |
always |
seek_null
It checks each column in the dataframe to see if they have null values or not and how many if there are any. After the process it returns a list full of the names of the columns which have null values.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
print_columns |
boolean |
False |
Attention
It is visible that how many null values have the columns, when print_columns is True. The values are printed out to the console.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
null_columns |
list |
always |
make_null
Sometimes the null values may be represented in different ways (using ‘unknown’ in string data for example) instead of being null inside the dataset.
Parameters |
Datatype |
Default Value |
|---|---|---|
matrix |
pandas dataframe or numpy array |
|
replace |
anything |
|
type |
string |
df |
Attention
type declares that matrix is a pandas dataframe or numpy array. It has two different valid values, which are ‘df’ and ‘np’. ‘df’ means pandas dataframe, ‘np’ means numpy array.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
matrix |
pandas dataframe or numpy array |
always |
stat_sum
It summarises the collected info about dataframe like describe method from Pandas.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
requested |
list |
|
only |
list |
None |
exclude |
list |
None |
get_dict |
boolean |
False |
verbose |
boolean |
True |
Tip
If only is not None, only indicated columns in the list are examined.
Tip
If exclude is not None, all columns except indicated in the list are examined.
Here is the list of valid keywords for requested:
Valid Keyword |
Meaning |
|---|---|
all |
if the list has it at index zero then it presumes that it contains all keywords |
min |
minimum |
max |
maximum |
width |
the difference between max and min |
mean |
arithmetic mean |
std |
standard deviation |
med |
median |
var |
variance |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
gen_results |
dictionary |
if get_dict is True |
extract_float
Sometimes float data might be held with a different representation (‘3.5$’ for example). In that case, the unwanted symbols (the dollar sign in this example) can be deleted and the datatype of the list can be converted from string to float.
Parameters |
Datatype |
Default Value |
|---|---|---|
column |
1D array |
|
symbols |
list |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
column |
1D Array |
always |
col_counts
It returns the frequency of unique values in columns by using value_counts function from pandas.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
exclude |
list |
None |
only |
list |
None |
Tip
If only is not None, only indicated columns in the list are examined.
Tip
If exclude is not None, all columns except indicated in the list are examined.
Attention
This function only prints out the result to the console. It does not return anything.
check_similarity
Sometimes the very same information can be held into two different columns with different representations. For example, the area code information can be stored with digits and their actual names into two different columns, but they hold the same thing.
Parameters |
Datatype |
Default Value |
|---|---|---|
col1 |
1D array |
|
col2 |
1D array |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
similarity |
boolean |
always |
Attention
If similarity is true, then it means that these columns have the same information.
find_broke
If the datatype of the column is different than expected, it can be examined by using this method and found the reason.
Parameters |
Datatype |
Default Value |
|---|---|---|
column |
1D array |
|
dtype |
datatype |
float |
get_indexes |
boolean |
True |
get_words |
boolean |
False |
verbose |
boolean |
True |
verbose_limit |
integer |
10 |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
indexes |
list |
get_indexes is True |
2 |
words |
list |
get_words is True |
expand_df
It oversamples the dataset by using SMOTE.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
output |
string |
|
sampling_strategy |
string or dictionary |
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
df |
pandas dataframe |
always |
split_as_df
It splits X and y arrays into train and test pandas dataframes instead of arrays.
Parameters |
Datatype |
Default Value |
|---|---|---|
X |
multidimensional array |
|
y |
1D array |
|
features |
list |
|
output |
string |
|
test_size |
float |
|
random_state |
int |
42 |
shuffle |
boolean |
True |
stratify |
1D array |
None |
Note
features are the list of the names of columns in the X array. output is the name of the column in the y array.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
dftrain |
pandas dataframe |
always |
2 |
dftest |
pandas dataframe |
always |
train_test_val_split
It splits the data into three groups: train, validation and test.
Parameters |
Datatype |
Default Value |
|---|---|---|
X |
multidimensional array |
|
y |
1D array |
|
test_size |
float |
|
val_size |
float |
|
random_state |
int |
42 |
shuffle |
boolean |
True |
stratify |
1D array |
None |
stratify_for_val |
boolean |
True |
Attention
The ratios given as test_size and val_size must be for the sum of the data.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
X_train |
multidimensional array |
always |
2 |
X_test |
multidimensional array |
always |
3 |
X_val |
multidimensional array |
always |
4 |
y_train |
1D array |
always |
5 |
y_test |
1D array |
always |
6 |
y_val |
1D array |
always |
synthetic_expand
It generates synthetic data based on actual data. It is designed to create datasets for educational purposes in data science.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
feature_info |
dictionary |
|
shape_zero |
integer |
Attention
feature_info must hold every requested feature name inside its keys from the dataframe. Values may take two different values: continuous and discrete.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
df |
pandas dataframe |
always |
multi_split
It splits the dataset, which is multi label. The function shuffles the dataset n times and, each time, seeks for the same sample-amount distribution in the test for classes in every label.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
labels |
list |
|
test_size |
float |
|
times |
integer |
50 |
Note
labels holds the list of the label names in the dataframe.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
X_train |
multidimensional array |
always |
2 |
X_test |
multidimensional array |
always |
3 |
y_trains |
dictionary |
always |
4 |
y_tests |
dictionary |
always |
Attention
y_trains and y_tests have a key value structure as label name - 1D array.
corr_analyse
It calculates the correlation between each feature in the dataset. After calculation, it also groups them by respect to their strength.
Attention
Each correlation value must be between -1 and 1. In the grouping phase, the range is accepted between 0 and 1, the negative side is also accepted as symmetric to the positive side. After that, the new range is split into four groups, which are ‘uncorrelated’, ‘weak’, ‘strong’ and ‘perfect’.
group |
interval |
|---|---|
uncorrelated |
0 <= y <= un_w |
weak |
un_w < score <= w_s |
strong |
w_s < score <= s_p |
perfect |
s_p < score <= 1 |
Parameters |
Datatype |
Default Value |
|---|---|---|
array |
2D array |
|
columns |
list |
|
un_w |
float |
0.1 |
w_s |
float |
0.5 |
s_p |
float |
0.9 |
verbose |
boolean |
True |
get_matrix |
boolean |
False |
csv_path |
string |
None |
Note
If csv_path is not none then the scores are logged in a csv file.
Note
The function always returns a dictionary with keys ‘columns’ and ‘score’.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
results |
dictionary |
always |
2 |
matrix |
2D array |
get_matrix is True |
scale_df
It scales the values of the pandas dataframe and returns a pandas dataframe again. If requested, output label(s) might be excluded from this process.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
output |
string or string list |
None |
mode |
string |
minmax |
params |
dictionary |
None |
Note
The function supports four different scaling methods: MinMaxScaler (when mode is ‘minmax’), StandardScaler (when mode is ‘standard’), RobustScaler (when mode is ‘robust’) and MaxAbsScaler (when mode is ‘maxabs’).
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
df |
pandas dataframe |
always |
corr_high
It gets the names of the features that have a high correlation with the output label.
Parameters
Datatype
Default Type
df
pandas dataframe
output
string
strengths
string list
[‘perfect’, ‘strong’, ‘weak’]
verbose
boolean
True
Note
strengths can only contain ‘perfect’, ‘strong’, ‘weak’ and ‘uncorrelated’.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
feature_high |
string list |
always |