pysemantic package¶
Submodules¶
pysemantic.cli module¶
semantic
- Usage:
- semantic list [–project=<PROJECT_NAME>] semantic add PROJECT_NAME PROJECT_SPECFILE semantic remove PROJECT_NAME [–dataset=<dname>] semantic set-schema PROJECT_NAME SCHEMA_FPATH semantic set-specs PROJECT_NAME –dataset=<dname> [–path=<pth>] [–dlm=<sep>] semantic add-dataset DATASET_NAME –project=<pname> –path=<pth> –dlm=<sep> semantic export PROJECT_NAME [–dataset=<dname>] OUTPATH
- Options:
- -h –help Show this screen -d –dataset=<dname> Name of the dataset to modify –path=<pth> Path to a dataset –dlm=<sep> Declare the delimiter for a dataset -p –project=<pname> Name of the project to modify -v –version Print the version of PySemantic
-
pysemantic.cli.
cli
(arguments)¶ cli - The main CLI argument parser.
Parameters: arguments (dict) – command line arguments, as parsed by docopt Returns: None
-
pysemantic.cli.
main
()¶
pysemantic.custom_traits module¶
Customized traits for advanced validation.
-
class
pysemantic.custom_traits.
AbsFile
(value='', filter=None, auto_set=False, entries=0, exists=False, **metadata)¶ Bases:
traits.trait_types.File
A File trait whose value must be an absolute path, to an existing file.
-
validate
(obj, name, value)¶
-
-
class
pysemantic.custom_traits.
DTypeTraitDictObject
(trait, object, name, value)¶ Bases:
traits.trait_handlers.TraitDictObject
Subclassed from the parent to aid the validation of DTypesDicts.
-
class
pysemantic.custom_traits.
DTypesDict
(key_trait=None, value_trait=None, value=None, items=True, **metadata)¶ Bases:
traits.trait_types.Dict
A trait whose keys are strings, and values are Type traits. Ideally this is the kind of dictionary that is passed as the dtypes argument in pandas.read_table.
-
validate
(obj, name, value)¶ Subclassed from the parent to return a DTypeTraitDictObject instead of traits.trait_handlers.TraitDictObhject.
-
pysemantic.errors module¶
Errors.
-
exception
pysemantic.errors.
MissingConfigError
¶ Bases:
exceptions.Exception
Error raised when the pysemantic configuration file is not found.
-
exception
pysemantic.errors.
MissingProject
¶ Bases:
exceptions.Exception
Error raised when project is not found.
pysemantic.exporters module¶
Exporters from PySemantic to databases or other data sinks.
-
class
pysemantic.exporters.
AbstractExporter
¶ Bases:
object
Abstract exporter for dataframes that have been cleaned.
-
get
(**kwargs)¶
-
set
(**kwargs)¶
-
-
class
pysemantic.exporters.
AerospikeExporter
(config, dataframe)¶ Bases:
pysemantic.exporters.AbstractExporter
Example class for exporting to an aerospike database.
-
run
()¶
-
set
(key_tuple, bins)¶
-
pysemantic.project module¶
The Project class.
-
class
pysemantic.project.
Project
(project_name=None, parser=None, schema=None)¶ Bases:
object
The Project class, the entry point for most things in this module.
-
datasets
¶ “List the datasets registered under the parent project.
Example: >>> project = Project('skynet') >>> project.datasets ['sarah connor', 'john connor', 'kyle reese']
-
export_dataset
(dataset_name, dataframe=None, outpath=None)¶ Export a dataset to an exporter defined in the schema. If nothing is specified in the schema, simply export to a CSV file such named <dataset_name>.csv
Parameters: - dataset_name (Str) – Name of the dataset to exporter.
- dataframe – Pandas dataframe to export. If None (default), this dataframe is loaded using the load_dataset method.
-
get_dataset_specs
(dataset_name)¶ Returns the specifications for the specified dataset in the project.
Parameters: dataset_name (str) – Name of the dataset Returns: Parser arguments required to import the dataset in pandas. Return type: dict
-
get_project_specs
()¶ Returns a dictionary containing the schema for all datasets listed under this project.
Returns: Parser arguments for all datasets listed under the project. Return type: dict
-
load_dataset
(dataset_name)¶ Load and return a dataset.
Parameters: dataset_name (str) – Name of the dataset Returns: A pandas DataFrame containing the dataset. Return type: pandas.DataFrame Example: >>> demo_project = Project('pysemantic_demo') >>> iris = demo_project.load_dataset('iris') >>> type(iris) pandas.core.DataFrame
-
load_datasets
()¶ Load and return all datasets.
Returns: dictionary like {dataset_name: dataframe} Return type: dict
-
reload_data_dict
()¶ Reload the data dictionary and re-populate the schema.
-
set_dataset_specs
(dataset_name, specs, write_to_file=False)¶ Sets the specifications to the dataset. Using this is not recommended. All specifications for datasets should be handled through the data dictionary.
Parameters: - dataset_name (str) – Name of the dataset for which specifications need to be modified.
- specs (dict) – A dictionary containing the new specifications for the dataset.
- write_to_file (bool) – If true, the data dictionary will be updated to the new specifications. If False (the default), the new specifications are used for the respective dataset only for the lifetime of the Project object.
Returns: None
-
update_dataset
(dataset_name, dataframe, path=None, **kwargs)¶ This is tricky.
-
-
pysemantic.project.
add_dataset
(project_name, dataset_name, dataset_specs)¶ Add a dataset to a project.
Parameters: Returns: None
-
pysemantic.project.
add_project
(project_name, specfile)¶ Add a project to the global configuration file.
Parameters: Returns: None
-
pysemantic.project.
get_datasets
(project_name=None)¶ Get names of all datasets registered under the project project_name.
Parameters: project_name (str) – name of the projects to list the datasets from. If None (default), datasets under all projects are returned. Returns: List of datasets listed under project_name, or if project_name is None, returns dictionary such that {project_name: [list of projects]} Return type: dict or list Example: >>> get_datasets('skynet') ['sarah_connor', 'john_connor', 'kyle_reese'] >>> get_datasets() {'skynet': ['sarah_connor', 'john_connor', 'kyle_reese'], 'south park': ['stan', 'kyle', 'cartman', 'kenny']}
-
pysemantic.project.
get_default_specfile
(project_name)¶ Returns the specifications file used by the given project. The configuration file is searched for first in the current directory and then in the home directory.
Parameters: project_name (str) – Name of the project for which to get the spcfile. Returns: Path to the data dictionary of the project. Return type: str Example: >>> get_default_specfile('skynet') '/home/username/projects/skynet/schema.yaml'
-
pysemantic.project.
get_projects
()¶ Get the list of projects currently registered with pysemantic as a list.
Returns: List of tuples, such that each tuple is (project_name, location_of_specfile) Return type: list Example: >>> get_projects() ['skynet', 'south park']
-
pysemantic.project.
get_schema_specs
(project_name, dataset_name=None)¶ Get the specifications of a dataset as specified in the schema.
Parameters: Returns: schema for dataset
Return type: Example: >>> get_schema_specs('skynet') {'sarah connor': {'path': '/path/to/sarah_connor.csv', 'delimiter': ','}, 'kyle reese': {'path': '/path/to/kyle_reese.tsv', 'delimiter':, ' '} 'john connor': {'path': '/path/to/john_connor.txt', 'delimiter':, ' '} }
-
pysemantic.project.
locate_config_file
()¶ Locates the configuration file used by semantic.
Returns: Path of the pysemantic config file. Return type: str Example: >>> locate_config_file() '/home/username/pysemantic.conf'
-
pysemantic.project.
remove_dataset
(project_name, dataset_name)¶ Removes a dataset from a project.
Parameters: Returns: None
-
pysemantic.project.
remove_project
(project_name)¶ Remove a project from the global configuration file.
Parameters: project_name (str) – Name of the project to remove. Returns: True if the project existed Return type: bool Example: >>> view_projects() Project skynet with specfile at /path/to/skynet.yaml Project south park with specfile at /path/to/south_park.yaml >>> remove_project('skynet') >>> view_projects() Project south park with specfile at /path/to/south_park.yaml
-
pysemantic.project.
set_schema_fpath
(project_name, schema_fpath)¶ Set the schema path for a given project.
Parameters: Returns: True, if setting the schema path was successful.
Example: >>> set_schema_fpath('skynet', '/path/to/new/schema.yaml') True
-
pysemantic.project.
set_schema_specs
(project_name, dataset_name, **kwargs)¶ Set the schema specifications for a dataset.
Parameters: Returns: None
Example: >>> set_schema_specs('skynet', 'kyle reese', path='/path/to/new/file.csv', delimiter=new_delimiter)
-
pysemantic.project.
view_projects
()¶ View a list of all projects currently registered with pysemantic.
Example: >>> view_projects() Project skynet with specfile at /path/to/skynet.yaml Project south park with specfile at /path/to/south_park.yaml
pysemantic.utils module¶
Misecellaneous bells and whistles.
-
class
pysemantic.utils.
TypeEncoder
(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None)¶ Bases:
json.encoder.JSONEncoder
-
default
(obj)¶
-
-
pysemantic.utils.
colnames
(filename, **kwargs)¶ Read the column names of a delimited file, without actually reading the whole file. This is simply a wrapper around pandas.read_csv, which reads only one row and returns the column names.
Parameters: - filename (str) – Path to the file to be read
- kwargs – Arguments to be passed to the pandas.read_csv
Return type: Example: Suppose we want to see the column names of the Fisher iris dataset.
>>> colnames("/path/to/iris.csv") ['Sepal Length', 'Petal Length', 'Sepal Width', 'Petal Width', 'Species']
-
pysemantic.utils.
generate_questionnaire
(filepath)¶ Generate a questionnaire for data at filepath.
This questionnaire will be presented to the client, which helps us automatically generate the schema.
Parameters: filepath (str) – Path to the file that needs to be ingested. Returns: A dictionary of questions and their possible answers. The format of the dictionary is such that every key is a question to be put to the client, and its value is a list of possible answers. The first item in the list is the default value. :rtype: dict
-
pysemantic.utils.
get_md5_checksum
(filepath)¶ Get the md5 checksum of a file.
Parameters: filepath (Str) – Path to the file of which to calculate the md5 checksum. Returns: MD5 checksum of the file. Return type: Str Example: >>> get_md5_checksum('pysemantic/tests/testdata/iris.csv') '9b3ecf3031979169c0ecc5e03cfe20a6'
pysemantic.validator module¶
Traited Data validator for pandas.DataFrame objects.
-
class
pysemantic.validator.
DataFrameValidator
¶ Bases:
traits.has_traits.HasTraits
A validator class for pandas.DataFrame objects.
-
clean
()¶ Return the converted dataframe after enforcing all rules.
-
rename_columns
()¶ Rename columns in dataframe as per the schema.
-
-
class
pysemantic.validator.
SchemaValidator
(**kwargs)¶ Bases:
traits.has_traits.HasTraits
A validator class for schema in the data dictionary.
-
classmethod
from_dict
(specification)¶ Get a validator from a schema dictionary.
Parameters: specification – Dictionary containing schema specifications.
-
classmethod
from_specfile
(specfile, name, **kwargs)¶ Get a validator from a schema file.
Parameters: - specfile – Path to the schema file.
- name – Name of the project to create the validator for.
-
get_parser_args
()¶ Return parser args as required by pandas parsers.
-
set_parser_args
(specs, write_to_file=False)¶ Magic method required by Property traits.
-
to_dict
()¶ Return parser args as required by pandas parsers.
-
classmethod
-
class
pysemantic.validator.
SeriesValidator
¶ Bases:
traits.has_traits.HasTraits
A validator class for pandas.Series objects.
-
apply_minmax_rules
()¶ Restrict the series to the minimum and maximum from the schema.
-
apply_regex
()¶ Apply a regex filter on strings in the series.
-
apply_uniques
()¶ Remove all values not included in the uniques.
-
clean
()¶ Return the converted dataframe after enforcing all rules.
-
do_drop_duplicates
()¶ Drop duplicates from the series if required.
-
do_drop_na
()¶ Drop NAs from the series if required.
-
do_postprocessing
()¶
-
drop_excluded
()¶ Remove all values specified in exclude_values.
-
Module contents¶
-
class
pysemantic.
Project
(project_name=None, parser=None, schema=None)¶ Bases:
object
The Project class, the entry point for most things in this module.
-
datasets
¶ “List the datasets registered under the parent project.
Example: >>> project = Project('skynet') >>> project.datasets ['sarah connor', 'john connor', 'kyle reese']
-
export_dataset
(dataset_name, dataframe=None, outpath=None)¶ Export a dataset to an exporter defined in the schema. If nothing is specified in the schema, simply export to a CSV file such named <dataset_name>.csv
Parameters: - dataset_name (Str) – Name of the dataset to exporter.
- dataframe – Pandas dataframe to export. If None (default), this dataframe is loaded using the load_dataset method.
-
get_dataset_specs
(dataset_name)¶ Returns the specifications for the specified dataset in the project.
Parameters: dataset_name (str) – Name of the dataset Returns: Parser arguments required to import the dataset in pandas. Return type: dict
-
get_project_specs
()¶ Returns a dictionary containing the schema for all datasets listed under this project.
Returns: Parser arguments for all datasets listed under the project. Return type: dict
-
load_dataset
(dataset_name)¶ Load and return a dataset.
Parameters: dataset_name (str) – Name of the dataset Returns: A pandas DataFrame containing the dataset. Return type: pandas.DataFrame Example: >>> demo_project = Project('pysemantic_demo') >>> iris = demo_project.load_dataset('iris') >>> type(iris) pandas.core.DataFrame
-
load_datasets
()¶ Load and return all datasets.
Returns: dictionary like {dataset_name: dataframe} Return type: dict
-
reload_data_dict
()¶ Reload the data dictionary and re-populate the schema.
-
set_dataset_specs
(dataset_name, specs, write_to_file=False)¶ Sets the specifications to the dataset. Using this is not recommended. All specifications for datasets should be handled through the data dictionary.
Parameters: - dataset_name (str) – Name of the dataset for which specifications need to be modified.
- specs (dict) – A dictionary containing the new specifications for the dataset.
- write_to_file (bool) – If true, the data dictionary will be updated to the new specifications. If False (the default), the new specifications are used for the respective dataset only for the lifetime of the Project object.
Returns: None
-
update_dataset
(dataset_name, dataframe, path=None, **kwargs)¶ This is tricky.
-
-
pysemantic.
test
()¶ Interactive loader for tests.