Contributing to sdssdb¶
Contributions to sdssdb
are most welcome. Product development happens on its GitHub repository. For details on how to develop for an SDSS product refer to the coding style guide. All contributions to sdssdb
need to be done as pull requests against the main
branch.
Contributing a new database or schema¶
In addition to improvements to the code, you can contribute database connections and model classes for your databases. To do so, first remember the directory structure of sdssdb
sdssdb
|
|__ peewee
| |
| |__ database_name
| |
| |__ __init__.py
| |__ schema1.py
| |__ schema2.py
|
|__ sqlalchemy
| |
| |__ database_name
| |
| |__ __init__.py
| |__ schema1.py
| |__ schema2.py
Let’s imagine you want to create files for a database called awesomedb
which has two schemas: amazing
and stupendous
. Depending on whether you are creating model classes for peewee or sqlalchemy (or both),
you will need to create a directory called awesomedb
under the correct library directory with a
__init__.py
file and two amazing.py
and stupendous.py
files. The following sections will show you
how to fill out those files depending on the library used.
Peewee¶
For an example of how to implement a database with Peewee you can look at the sdss5db implementation. Let’s start with the database connection in the __init__.py
file. The basic structure is quite simple and would look something like
from sdssdb.connection import PeeweeDatabaseConnection
from sdssdb.peewee import BaseModel
class AwesomedbDatabaseConnection(PeeweeDatabaseConnection):
dbname = 'awesomedb'
database = AwesomedbDatabaseConnection(autoconnect=True)
# Create a new base model class for the observatory and bind the database
class AwesomedbModel(BaseModel):
class Meta:
database = database
The first two lines simply import the base classes for the database connection and base model class. We then subclass PeeweeDatabaseConnection
to create the connection for awesomedb
, overriding the dbname
attribute. We then instantiate the database connection as database
. Note the autoconnect=True
parameter which tells the database connection to try to use the best available profile to connect when the class gets instantiated. Finally, we subclass BaseModel
and we bind the database connection to it.
Next we need to creates the model classes themselves. At its simplest, a model class represents a table in a given schema and contains a list of the columns in the table, each one as a class attribute. Model classes must subclass from a base class (AwesomedbModel
in our example) that has been linked to the database connection. The default mode in Peewee is to explicitely define all columns, as opposed to autoloaded. To help with this task you can use the pwiz model generator. For example, to create a file with the list of model classes for stupendous
you would run, from a terminal
python -m pwiz -e postgresql -s stupendous awesomedb > stupendous.py
Note that you may need to pass additional flags with the username, host, or port. Refer to the documentation for those.
Once the file has been generated you will need to do some changes. On the top of the file import
from . import database
from . import AwesomedbModel as BaseModel
The first line conveniently allows for access to the database connection from the schema submodule. The second one renames AwesomedbModel
to BaseModel
so that all the model classes in the file inherit from it. You’ll probably need to make some other changes to the file, especially to the foreign keys to make sure they match your naming requirements.
Using reflection with Peewee¶
Peewee provides a reflection utility (internally used by pwiz
). Based on this tool we developed a reflection metaclass
that can be used to expedite the creating of models by only requiring to define foreign keys. Note that this is not an official component of Peewee and it comes with certain caveats. Before using the reflection metaclass, make sure to read the API documentation
.
To define a base class with reflection we do
import peewee
from sdssdb.peewee import ReflectMeta
class ReflectBaseModel(peewee.Model, metaclass=ReflectMeta):
class Meta:
primary_key = False # To make sure Peewee doesn't add its own PK.
use_reflection = False # We'll enable reflection manually for certain models.
database = database
class AwesomedbModel(ReflectBaseModel):
class Meta:
use_reflection = True
schema = 'stupendous'
table_name = 'stupendous_table'
When the connection is created this model will be reflected and autocompleted with all the columns that exist in the table. The reflection will include the foreign keys
that have been defined for the table in the database. Sometimes this is not desirable and we’d rather create them manually. In this case we can add the attribute reflection_options = {'skip_foreign_keys': True}
to Meta
in the ReflectBaseModel
. You can check the catalogdb models for an implementation of this type.
If a model inherits from a base model using ReflectMeta
, it is possible to call Model.reflect
manually to trigger a model reflection. This works even if use_reflection=False
.
The default Peewee reflection process requires multiple queries against the database for each table. This can become quite slow if the schema contains many tables or if the connection has significant overhead, for example when connected over an SSH tunnel. In these cases one can set Meta
with reflection_options = {'use_peewee_reflection': False}
. This will use a reflection system that is designed to minimise the number of queries needed for schema introspection. Note that this system is experimental and doesn’t reflect foreign keys or constraints. If you find problems revert to the default use_peewee_reflection=True
.
SQLAlchemy¶
Creating a database connection and model classes for SQLALchemy is quite similar to Peewee. As before, refer to the implementation of sdss5db for a good example. In this case the __init__.py
file would look like
from sdssdb.connection import SQLADatabaseConnection
from sqlalchemy.ext.declarative import declarative_base, DeferredReflection
from sdssdb.sqlalchemy import BaseModel
# we need a shared common Base when joining across multiple schema
AwesomedbBase = declarative_base(cls=(DeferredReflection, BaseModel,))
class AwesomedbDatabaseConnection(SQLADatabaseConnection):
dbname = 'sdss5db'
base = AwesomedbBase
database = AwesomedbDatabaseConnection(autoconnect=True)
Note that we define AwesomedbBase
as a declarative_base
using the SQLAlchemy sdssdb
base class and a DeferredReflection
base class. The latter allows for the autoloading of table columns but only at the time at which the model classes are prepared.
For the model classes you will need to write the files manually but there is no need to fill out all the column names. The deferred reflection will take care of that. An example of how the stupendous.py
file would look is
from sqlalchemy import Column, ForeignKey, Integer, String
from sqlalchemy.ext.declarative import AbstractConcreteBase, declared_attr
from sqlalchemy.orm import relationship
from sdssdb.sqlalchemy.awesome import AwesomedbBase, database
class Base(AbstractConcreteBase, AwesomedbBase):
__abstract__ = True
_schema = 'stupendous'
_relations = 'define_relations'
@declared_attr
def __table_args__(cls):
return {'schema': cls._schema}
class User(Base):
__tablename__ = 'user'
class Address(Base):
__tablename__ = 'address'
print_fields = ['zipcode']
def define_relations():
User.address = relationship(Address, backref='user')
database.add_base(Base)
In this example we have two tables, user
and address
that we model as User
and Address
respectively. Note that we don’t need to specify any column at this point, just the __tablename__
metadata property. All model classes need to subclass from Base
, which in turn subclasses from
AbstractConcreteBase
and AwesomedbBase
. We can use the special attribute
print_fields
to define a list of fields that will be output in the standard representation of the model
instances (primary keys and label
fields are always output).
The define_relations
function must contain all the foreign key relationships for this model. In this
case there only one relationship that allows to retrieve the address for a given User
(and its
back reference). We need to encapsulate the relationships in a function so that they can be recreated if
we change the database connection to point to a different database. Finally, we add the
database.add_base(Base)
statement to bind the base to the database connection.
Testing Your New Database¶
After creating your database, you will want to ensure its stability and robustness as you expand its
capabilities over time. This can be done by writing tests against your database. The testing directory system
is very similar to the sdssdb
database directory, with test database files located within separate library
folders for peewee
databases (pwdbs
) or sqlalchemy
databases (sqladbs
).
tests
|
|__ pwdbs
| |
| |__ __init__.py
| |__ conftest.py
| |__ models.py
| |__ factories.py
| |
| |__ test_database1.py
| |__ test_database2.py
|
|__ sqladbs
| |
| |__ __init__.py
| |__ conftest.py
| |__ models.py
| |__ factories.py
| |
| |__ test_database1.py
| |__ test_database2.py
|
|__ conftest.py
|__ test_generic_items.py
Most Python testing frameworks look for tests in files named test_xxxx.py
. Under each library we create a
test_xxxx
file for each new database we want to test. Since we’ve created a new awesomedb
database, our
testing file will be test_awesomedb.py
. This file gets placed under either the pwdbs
or sqladbs
(or both)
depending on if your database is using peeewee
or sqlalchemy
.
sdssdb
uses pytest as its testing framework, and assumes user
familiarity with pytest. The test directories contain conftest.py
files which are files used for sharing
fixture functions between tests. See here
for more details. You will also see files called models
and factories
. We will come back to these later.
Peewee¶
Let’s see what an example test_awesomedb.py
might look like
import pytest
from sdssdb.peewee.awesomedb import database, stupendous
@pytest.mark.parametrize('database', [database], indirect=True)
class TestStupdendous(object):
def test_user_count(self):
''' test that count of user table returns results '''
user_ct = stupendous.User.select().count()
assert user_ct > 0
We follow pytest’s test naming convention
for naming test files as well as tests within files. In our test_awesomedb
file, we group similar tests
by schema together into Test
classes, i.e. for the stupendous
schema, we create a TestStupendous
class.
All tests related to the stupendous
schema will be defined in this class. Individual tests within each class
are defined as methods on the class, named with test_xxxx
.
In order for our test class to understand that we wish to use the awesomedb
database for all defined tests, we
use the provided database
fixture function and parametrize it with the awesomedb
database. See
fixture parametrization to learn more
about how to parametrize tests or fixtures.
We’ve defined a simple test, test_user_count
, that checks that our user
table returns
some number of results > 0. In this case, we are a performing a simple select statement that does not modify the
database. If we are writing tests that perform write operations on the database, we could use the provided
transaction
fixture to ensure all changes are rolled back.
SQLAlchemy¶
The example test_awesomedb.py
file for a sqlalchemy
database will look very similar to the
peewee
version.
import pytest
from sdssdb.sqlalchemy.awesomedb import database
if database.connected:
from sdssdb.sqlalchemy.awesomedb import stupendous
@pytest.mark.parametrize('database', [database], indirect=True)
class TestStupdendous(object):
def test_user_count(self, session):
''' test that count of user table returns results '''
user_ct = session.query(stupendous.User).count()
assert user_ct > 0
There are two main differences in this file from the peewee
version. The first is that we must wrap the
import of the stupendous
models inside a conditional that checks if the database has been successfully
connected to. This is needed because importing sqlalchemy
models when no database exists, or
cannot connect, breaks other succcessful database imports. The second change is the use of the session
fixture inside the test. Since sqlalchemy
needs a db session to perform queries, we use the
provided session
pytest fixture. This fixture will ensure that all changes made to the database
are rolled back and not permanent.
Generating and Inserting Fake Data into Your Database Tables¶
If you are only interested in writing simple tests that test real data in your database tables, then you can stop here and start writing your tests. Sometimes, however, you may want to write tests for special database queries or model functions where you don’t quite have the right data, or enough of it, loaded. In these cases, we can generate fake data and insert it dynamically into our database tables. To do so, we have to create a “model factory”. This factory creates fake data based on a database Model.
The following examples use the following resources to generate fake data:
factory_boy - creates db model factories to generate fake entries
faker - creates fake data as needed by models
pytest-factoryboy - turns model factories into pytest fixtures
Let’s see how to create factories to generate fake Users and Addressess, inside the factories.py
file,
using the peewee
library implementation as an example.
from sdssdb.peewee.awesomedb import database as awesomedb, stupendous
from .factoryboy import PeeweeModelFactory
class AddressFactory(PeeweeModelFactory):
# define a Meta class with the associated model and database
class Meta:
model = stupendous.Address
database = awesomedb
# define fake data generators for all columns in the table
pk = factory.Sequence(lambda n: n)
street = factory.Faker('street_address')
city = factory.Faker('city')
state = factory.Faker('state_abbr')
zipcode = factory.Faker('zipcode')
full = factory.LazyAttribute(lambda a: f'{a.street}\n{a.city}, {a.state} {a.zipcode}')
class UserFactory(PeeweeModelFactory):
class Meta:
model = stupendous.User
database = awesomedb
pk = factory.Sequence(lambda n: n)
first = factory.Faker('first_name')
last = factory.Faker('last_name')
name = factory.LazyAttribute(lambda u: f'{u.first} {u.last}')
# establishes the one-to-one relationship
address = factory.SubFactory(AddressFactory)
If the User
and Address
models created previously have the following columns on each table, we use
the factorboy declarations
and factory.Faker providers to assign each column
a fake data generator. For each factory we need to define a Meta
class in it that defines the database
model associated with it, as well as the database it belongs to.
These factories allow us to create fake instances of data that automatically inserts into the
designated database table. To create an instance locally without database insertion, you can use
UserFactory.build
or to create in bulk, use UserFactory.create_batch
.
>>> user = UserFactory()
>>> user
>>> <User: pk=1, name='Walter Brown'>
>>> user.address
>>> <Address: pk=1>
The more common use however will be in tests. These factories automatically get converted into pytest
fixture functions using pytest-factoryboy
. Let’s see how we would use this in test_awesomedb.py
.
@pytest.mark.parametrize('database', [database], indirect=True)
class TestStupdendous(object):
def test_new_user(self, user_factory):
''' test that we add a new user '''
user_factory.create(first='New Bob')
user = stupendous.User.get(stupendous.User.first=='New Bob')
assert user.first == 'New Bob'
Notice the lowercase-underscore syntax. This is the fixture name of the UserFactory
. The above examples
were written using the peeweee
implementation. For real examples, see the sdss5db tests in
tests/pwdbs/test_sdss5db.py
and associated factories in test/pwdbs/factories.py
. The sqlalchemy
version of defining a factory is very similar.
import factory
from sdssdb.tests.sqladbs import get_model_from_database
from sdssdb.sqlalchemy.awesomedb import database as awesomedb
stupendous = get_model_from_database(awesomedb, 'stupendous')
if stupendous:
class UserFactory(factory.alchemy.SQLAlchemyModelFactory):
''' factory for stupendous user table '''
class Meta:
model = stupendous.User
sqlalchemy_session = aweseomdb.Session # the SQLAlchemy session object
# column definitions as before
pk = factory.Sequence(lambda n: n)
...
Because sqlalchemy
models cannot be imported when no database exists locally, we must use
get_model_from_database
to conditionally import the models we need, and place the factory class inside
a conditional. Additionally, the factory Meta class needs the sqlalchemy
Session rather the database itself.
All other behaviours and defintions are the same. For examples of sqlalchemy
factories and their uses, see
tests/sqladbs/factories.py
and the mangadb tests in tests/sqladbs/test_mangadb.py
.
Using a Generic Test Database¶
Sometimes you may want to test a function common to many databases, or a generic database connection, or simply
not want to mess with real databases. In these cases, a temporary test postgres database is available to use.
By default, when no real database is passed into the database
fixture function, the test database is generated.
For example, the peewee
test example case from earlier would now be the following, with the pytest
parametrization line removed.
class TestStupdendous(object):
def test_user_count(self):
''' test that count of user table returns results '''
user_ct = stupendous.User.select().count()
assert user_ct > 0
This test would now use the temporary database, which is setup and destroyed for each test module. Because
the test database is created as a blank slate, all database models must be created as well, in addition to any
model factories. These models can be stored in the models.py
file under the respective library directories.
See any of the models.py
files for examples of creating test database models, and factories.py
for their
associated factories. See any of the tests defined in test_factory.py
for examples of how to write tests
against temporary database models defined in models.py
.
Should I use Peewee or SQLAlchemy?¶
Use the one that you prefer! Both Peewee and SQLAlchemy have their pros and cons and their own funclubs. Ideally you’ll want to provide at least basic support for both library (and, indeed, it’s not difficult if you follow the instructions above) to reach a wider audience. But if you only provide support for one library choose the one that you are more familiar with or the one that feels right.