开源项目源码解读--kaptan，配置文件解析库

1. python开源项目源码解读--kaptan

kapta是一个灵活的配置文件解析库，git地址： https://github.com/emre/kaptan，它有多灵活呢？我们先来梳理一下python项目常用的配置文件格式:

json
py文件
ini文件
yaml文件

上面4种文件格式，kaptan全部支持，此外还支持python字典做配置。虽然支持了这么多的文件格式，但kaptan提供了统一的接口，不需要用户花费额外的精力对这些文件做处理，这是kaptan的一大特色。

kaptan的另一个吸引我的地方，是获取配置项数值的方式，不论你采用哪种配置文件，都可以使用统一的方式来获取配置项数值，不仅如此，获取的方式极其简单，以官方示例来展示kaptan的便捷性，新建一个config.ini文件，内容为

[development]
database_uri = mysql://root:123456@localhost/posts

[production]
database_uri = mysql://poor_user:poor_password@localhost/poor_posts

使用kaptan来解析配置文件

config = kaptan.Kaptan(handler="ini")
config.import_config('config.ini')

print(config.get("production.database_uri"))
# output: mysql://poor_user:poor_password@localhost/poor_posts

创建Kaptan对象时，需要指定文件的类型，然后使用import_config方法加载配置文件，获取配置项的数值时，允许你使用采用xxx.xxx.xxx的形式从顶层配置开始，逐级检索，直到自己想要的配置项。

对于这个库，我有两处比较感兴趣的技术点：

配置文件类型不同，数据格式也就不同，加载这些配置文件后究竟是如何存储的，才能做到提供统一的获取配置项数值的方法？
获取配置项数值时，允许采用xxx.xxx.xxx的形式来定位配置项，这又是怎么做到的？

带着这两个问题，我们一起阅读它的源码。

2. Kaptan 类

2.1 HANDLER_MAP

kaptan的核心类是Kaptan，在脚本kaptn_init_.py脚本中定义。该类定义了类属性HANDLER_MAP

class Kaptan(object):

    HANDLER_MAP = {
        'json': JsonHandler,
        'dict': DictHandler,
        'yaml': YamlHandler,
        'file': PyFileHandler,
        'ini': IniHandler,
    }

    def __init__(self, handler=None):
        self.configuration_data = dict()
        self.handler = None
        if handler:
            self.handler = self.HANDLER_MAP[handler]()

结合前面的示例代码，在创建Kaptan实例对象时，根据handler参数决定self.handler的值，HANDLER_MAP中存储了5个handler，负责处理不同的类型的配置文件里的数据。

2.2 import_config

import_config是Kaptan的核心方法，功能是加载配置文件

   def import_config(self, value):
        if isinstance(value, dict):  # load python dict
            self.handler = self.HANDLER_MAP['dict']()
            data = value
        elif os.path.isfile(value) and not self._is_python_file(value):
            if not self.handler:
                try:
                    key = HANDLER_EXT.get(os.path.splitext(value)[1][1:], None)
                    self.handler = self.HANDLER_MAP[key]()
                except:
                    raise RuntimeError("Unable to determine handler")
            with open(value) as f:
                data = f.read()
        elif self._is_python_file(value):  # is a python file
            self.handler = self.HANDLER_MAP[HANDLER_EXT['py']]()
            if not value.endswith('.py'):
                value += '.py'  # in case someone is referring to a module
            data = os.path.abspath(os.path.expanduser(value))
            if not os.path.isfile(data):
                raise IOError('File {0} not found.'.format(data))
        else:
            if not self.handler:
                raise RuntimeError("Unable to determine handler")

            data = value

        self.configuration_data = self.handler.load(data)
        return self

尽管在初始化函数中，设计了handler参数供使用者设置，但这个参数不是位置参数，而是关键字参数，允许你不进行设置。因此在import_config方法里，作者根据文件的后缀对文件的类型进行了判断，并根据文件的后缀来决定用哪一个handler。

这样的设计，会不会有些冗余呢？我认为这不是一种冗余的设计，虽然可以在初始化函数中设置handler的类型，但有可能在调用import_config时传入了错误的文件名称，在import_config方法中对文件类型再次进行确认，并根据这一次的判断决定使用哪个handler。

虽然这样的设计不算冗余，但我并不认为这样的设计很有必要。在初始化函数里确定handler就已经足够了，使用者应当非常明确自己要加载什么类型的配置文件，如果实际加载文件与所选择的handler不相符，抛出异常是一个更好的选择。

加载后的数据保存在configuration_data属性中，使用get方法获取配置项数值时就是从这个实例属性中提取的，只有先解开configuration_data属性的存储方式，才能理解get方法如何根据xxx.xxx.xxx形式的字符串提取配置项数值。

3. 5个Handler

kaptan提供了5个Handler负责处理不同类型的数据，先从最简单的DictHandler开始理解他们。

3.1 DictHandler

DictHandler的代码非常少，而且带来了两个非常有价值的信息：

Handler 都是BaseHandler的子类
不论是什么格式的配置文件，加载都有以字典的形式存储

from __future__ import print_function, unicode_literals

from . import BaseHandler


class DictHandler(BaseHandler):

    def load(self, data):
        return data

    def dump(self, data):
        return data

BaseHandler定义如下

class BaseHandler(object):
    """Base class for data handlers."""

    def load(self, data):
        raise NotImplementedError

    def dump(self, data):
        raise NotImplementedError

作者的本意是所有继承BaseHandler的子类都必须实现load和dump这两个方法，但他并没有使用抽象类技术，如果改用抽象类技术来实现BaseHandler，可以将类定义成下面的样子

from abc import ABCMeta, abstractmethod

class BaseHandler(metaclass=ABCMeta):
    """Base class for data handlers."""

    @abstractmethod
    def load(self, data):
        pass

    @abstractmethod
    def dump(self, data):
        pass

BaseHandler的子类如果没有全部实现抽象方法，则不能被实例化，下面的代码一定会报错

class DictHandler(BaseHandler):
    pass

dict_handler = DictHandler()

DictHandler 的load方法返回的是字典，这意味着，其他子Handler类的load方法也同样返回字典，之所以有这样的猜测，是考虑到Kaptan提供了同一个的获取配置项数值的方法get，假设各个子类的load方法返回的数据类型不一致，那么就只能在get方法里做适配；而如果各个子类的load方法可以保证返回相同类型的数据，这样一来，对不同格式文件数据的转换和适配就是由各个子类完成的，这是更加合理的设计。

3.2 JsonHandler

有了3.1 小节的分析，即便不看JsonHandler的代码，你也应该想到它是用json模块的loads方法来加载配置文件

class JsonHandler(BaseHandler):

    def load(self, data):
        return json.loads(data)

    def dump(self, data, **kwargs):
        return json.dumps(data, **kwargs)

jdon.loads方法，返回的也是字典

3.3 YamlHandler

yaml文件，可以使用yaml模块加载，而且加载后的数据也是字典

class YamlHandler(BaseHandler):

    def load(self, data, safe=True):
        if safe:
            func = yaml.safe_load
        else:
            func = yaml.load
        return func(data)

    def dump(self, data, safe=True, **kwargs):
        if safe:
            func = yaml.safe_dump
        else:
            func = yaml.dump
        return func(data, **kwargs)

3.4 IniHandler

相比于前面几个Handler，IniHandler的实现稍显复杂，并不容易理解，对于ini文件，可以使用configparser模块来解析，但所得到的解析结果并不是字典，因此需要做一次转换。

先来看，如何使用configparser模块解析ini文件

import configparser

config = configparser.ConfigParser()
config.read("config.ini", encoding="utf-8")
database_uri = config.get("development", 'database_uri')
print(database_uri)     # mysql://root:123456@localhost/posts

仅从代码量上就能看出来，用kaptan解析ini文件要便捷的多，ConfigParser是RawConfigParser的子类，作者自定了一个RawConfigParser的子类

class KaptanIniParser(configparser.RawConfigParser):
    def from_dict(self, dictionary):
        self._sections = dictionary

    def as_dict(self):
        d = dict(self._sections)
        for k in d:
            d[k] = dict(self._defaults, **d[k])
            d[k].pop('__name__', None)
        return d

在KaptanIniParser中，作者实现了as_dict方法，from_dict方法以我看病没有什么实际用处，因此我们只关心as_dict方法就好。要想理解这个方法，就必须弄清楚self._sections的内容

import configparser

class KaptanIniParser(configparser.RawConfigParser):
    def from_dict(self, dictionary):
        self._sections = dictionary

    def as_dict(self):
        d = dict(self._sections)
        for k in d:
            d[k] = dict(self._defaults, **d[k])
            d[k].pop('__name__', None)
        return d

config = KaptanIniParser()
config.read("config.ini", encoding="utf-8")
print(config._sections)

程序输出结果

OrderedDict([('development', OrderedDict([('database_uri', 'mysql://root:123456@localhost/posts')])), 
('production', OrderedDict([('database_uri', 'mysql://poor_user:poor_password@localhost/poor_posts')]))])

self._sections 的类型是OrderedDict， OrderedDict虽然不是字典的子类，但你可以像使用字典一样去使用它，as_dict方法将OrderedDict转换为字典。

IniHandler类的laod方法传入的data参数，是字符串，这一点可以通过Kaptan类的import_config得到印证，而RawConfigParser类的read方法只接收文件类型的数据，不接收字符串，有很多文件操作方面的方法都是如此。面对这种情况，可以使用io模块的StringIO类，在内存中，它可以将字符串伪装成一个文件句柄，具备文件对象的I/O能力

class IniHandler(BaseHandler):

    def load(self, value):
        config = KaptanIniParser()
        # ConfigParser.ConfigParser wants to read value as file / IO
        config.read_file(StringIO(value))
        return config.as_dict()

3.5 PyFileHandler

PyFileHandler 这个类值得认真研究一番，首先看一下官方示例，新建一个名为config.py的文件

setting = {
    'environment': 'DEV',
    'redis_uri': 'redis://localhost:6379/0',
    'debug': False,
    'pagination': {
        'per_page': 10,
        'limit': 20,
    }
}

使用kaptan解析加载

import kaptan

config = kaptan.Kaptan(handler='file')
config.import_config('config.py')
print(config.get('setting'))

python文件的加载，关键点在于将它作为一个模块来引入，我们平时引入一个模块是，用的是import， PyFileHandler必须实现类似import的功能，完成对python文件的加载，实现该功能的是import_pyfile函数

def import_pyfile(pathname, mod_name=''):
    if not os.path.isfile(pathname):
        raise IOError('File {0} not found.'.format(pathname))

    if sys.version_info[0] == 3 and sys.version_info[1] > 2:  # Python >= 3.3
        import importlib.machinery
        loader = importlib.machinery.SourceFileLoader('', pathname)
        mod = loader.load_module(mod_name)
    else:  # 2.6 >= Python <= 3.2
        import imp
        mod = imp.load_source(mod_name, pathname)
    return mod

函数内通过对sys.version_info的分析来判断当前python环境是3还是2，我们只看python3环境下如何动态的加载一个python模块。

    import importlib.machinery
    loader = importlib.machinery.SourceFileLoader('', pathname)
    mod = loader.load_module(mod_name)

只用了3行代码，importlib是python的标准库，网上能够找到的介绍这个模块的文章并不多，SourceFileLoader类，可以根据源文件路径创建loader对象，load_module方法最终将python文件加载为python模块，这和直接使用import关键字的功效是一样的。

将配置文件加载为python模块后，还需要对模块里的配置进行解析

class PyFileHandler(BaseHandler):

    def load(self, file_):
        module = import_pyfile(file_)
        data = dict()
        for key in dir(module):
            value = getattr(module, key)
            if not key.startswith("__"):
                data.update({key: value})
        return data

dir函数以列表的形式返回对象所拥有的属性和方法，如果传入一个模块，那么返回的列表里就包含了这个模块里所定义的变量，函数，类等一切对象的名称，你不妨直接用config.py做实验

import config

print(dir(config))

输出内容如下

['__builtins__', '__cached__', '__doc__',
'__file__', '__loader__', '__name__', 
'__package__', '__spec__', 'setting']

除了setting，还有许多以双下划线开头的对象名称，这些都是模块固有的属性，在load方法里需要剔除的。

4. 获取配置项的值

获取配置项的值，使用Kaptan的get方法，get方法调用了_get方法

    def _get(self, key):
        current_data = self.configuration_data

        for chunk in key.split('.'):
            if isinstance(current_data, collections_abc.Mapping):
                current_data = current_data[chunk]
            elif isinstance(current_data, collections_abc.Sequence):
                chunk = int(chunk)

                current_data = current_data[chunk]
            else:
                # A scalar type has been found
                return current_data

        return current_data

这里的key，可以是"setting.pagination.limit"，以这个例子，我们来分析这段代码是如何工作的。

self.configuration_data 是PyFileHandler 的load方法返回的字典，其内容为

{
    'setting': {
        'environment': 'DEV',
        'redis_uri': 'redis://localhost:6379/0',
        'debug': False,
        'pagination': {
            'per_page': 10,
            'limit': 20,
        }
    }
}

key.split('.')将key分为三段，分别是setting， pagination， limit, 算法的核心是下面5行代码

            if isinstance(current_data, collections_abc.Mapping):
                current_data = current_data[chunk]
            elif isinstance(current_data, collections_abc.Sequence):
                chunk = int(chunk)
                current_data = current_data[chunk]

如果current_data是Mapping的实例，则使用[]运算符获取chunk所对应的值，并赋值给current_data，这样下一次循环时，current_data的值就已经发生了改变，从字典的层级关系上看，更进了一层，距离最终想要获取的配置项更进了。

如果current_data是Sequence的实例，那么当前这个chunk一定是索引数值，因此需要转换为int，并使用[]运算符获取指定索引的值。

分析到这里，谜题已经解开了一半，Mapping和Sequence 分别是什么呢？他们是列表，字典等数据类型的父类

import collections.abc as collections_abc


print(issubclass(dict, collections_abc.Mapping))                    # True
print(issubclass(list, collections_abc.Sequence))                   # True

print(isinstance({'name': 'python'}, collections_abc.Mapping))      # True
print(isinstance([1, 2, 3], collections_abc.Sequence))              # True

前面，我们已经确定了self.configuration_data是一个字典，且是一个多层嵌套的字典，那么字典里的value可能是字典，也可能是列表等Sequence对象。对于xxx.xxx.xxx形式的key，使用split方法分隔后，从第一个开始，逐个取值，层层递进，在取值时，要根据current_data的类型来决定如何取值，所取到的值，还要重新赋值给current_data，下一次循环时，下一个chunk要从上一次循环后所保留的current_data中继续取值。

5. 总结

kaptan库的设计与实现，并不复杂，非常适合新手学习它的设计思路和实现方式。在思路上，不同格式的文件被加载后，都转换为字典这一种类型的数据进行存储，这就简化了get方法的实现。不同格式的文件，使用不同的handler类来处理，这样非常符合面向对象的思想，千万不要定义一个类，而后用不同的方法处理不同格式的文件，这样也能最终实现功能，但代码组织的非常混乱，不利于扩展更多格式的文件。

开源项目源码解读--kaptan， 配置文件解析库