python解析sql提取表名

本文介绍三种用python解析sql提取表名的方法，分别是正则表达式提取，使用sqlparse 库进行提取，使用sql_metadata进行提取。

解析sql提取出表名，在特定场景下是非常有应用价值的操作，假设你提供了一个供用户执行sql的客户端，不论是c/s还是b/s架构，都可能会面临一个强烈的需求，控制不同的人访问表的权限。不同的用户，可以访问不同的表，这种权限控制是十分常见的要求，解决的思路之一便是解析sql，从sql语句中提取出表名，然后根据用户是否拥有该表的操作权限来决定这个sql是否可以执行。

目标很明确，思路很清晰，but，从sql语句中提取出表名却并不是一件容易的事情，调研了几日，总算是找到一种相对靠谱的方法。

1. 正则表达式提取

通过正则表达式来提取sql语句中的表名，听起来是一个不错的选择，来看示例代码

import ply.lex as lex, re


def extract_table_name_from_sql(sql_str):

    # remove the /* */ comments
    q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)

    # remove whole line -- and # comments
    lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]

    # remove trailing -- and # comments
    q = " ".join([re.split("--|#", line)[0] for line in lines])

    # split on blanks, parens and semicolons
    tokens = re.split(r"[\s)(;]+", q)

    # scan the tokens. if we see a FROM or JOIN, we set the get_next
    # flag, and grab the next one (unless it's SELECT).

    result = set()
    get_next = False
    for token in tokens:
        if get_next:
            if token.lower() not in ["", "select"]:
                result.add(token)
            get_next = False
        get_next = token.lower() in ["from", "join"]

    return result

print(extract_table_name_from_sql("select * from user"))
print(extract_table_name_from_sql("select * from user left join teacher on user.teacher_id=teacher.id"))

输出结果

{'user'}
{'teacher', 'user'}

仅仅看着两条测试语句，效果还不错，可是随着测试的深入，就会发现至少这段代码有许多情况无法处理，比如下面两个sql

print(extract_table_name_from_sql("select user.name, teacher.name from user,teacher"))
print(extract_table_name_from_sql("update user set age = 14 where user.id= 100"))

输出结果为

{'user,teacher'}
set()

不能因为这段示例代码不能有效提取sql中的表名就否定使用正则表达式的思路，但几天的搜索下来，确实没有找到让我满意的方法。

2.sqlparse

sqlparse 是一个python的专门用于解析sql的库，它的解析相比于正则就要高级了一些，它可以区分出哪些是关键字，哪些是标识符，不需要像正则那样去处理注释，而且官方给出了一个抽取表名的示例

import sqlparse
from sqlparse.sql import IdentifierList, Identifier
from sqlparse.tokens import Keyword, DML


def is_subselect(parsed):
    if not parsed.is_group:
        return False
    for item in parsed.tokens:
        if item.ttype is DML and item.value.upper() == 'SELECT':
            return True
    return False


def extract_from_part(parsed):
    from_seen = False
    for item in parsed.tokens:
        if from_seen:
            if is_subselect(item):
                yield from extract_from_part(item)
            elif item.ttype is Keyword:
                return
            else:
                yield item
        elif item.ttype is Keyword and item.value.upper() == 'FROM':
            from_seen = True


def extract_table_identifiers(token_stream):
    for item in token_stream:
        if isinstance(item, IdentifierList):
            for identifier in item.get_identifiers():
                yield identifier.get_name()
        elif isinstance(item, Identifier):
            yield item.get_name()
        # It's a bug to check for Keyword here, but in the example
        # above some tables names are identified as keywords...
        elif item.ttype is Keyword:
            yield item.value


def extract_tables(sql):
    stream = extract_from_part(sqlparse.parse(sql)[0])
    return list(extract_table_identifiers(stream))


if __name__ == '__main__':
    sql = """
    select K.a,K.b from (select H.b from (select G.c from (select F.d from
    (select E.e from A, B, C, D, E), F), G), H), I, J, K order by 1,2;
    """

    tables = ', '.join(extract_tables(sql))
    print('Tables: {}'.format(tables))

输出结果

Tables: A, B, C, D, E, F, G, H, I, J, K

看起来也还不错哦，可还是有严重缺陷

if __name__ == '__main__':
    sql = "select * from user as u where u.id = 99"

    tables = ', '.join(extract_tables(sql))
    print('Tables: {}'.format(tables))

对于这样的语句，什么都解析不出来

3. sql_metadata

sql_metadata 是一个基于sqlparse的sql解析库，提供了get_query_tables 方法，从sql中提取表名的效果是所有调研的解析方法中最优质的

import sql_metadata

sql = '   select x1,x2 from liepin.a as atable left         join b on atable.id = b.id right join c on c.id = atable.id'
sql = ' '.join(sql.split())


print(sql_metadata.get_query_tables("select * from user, user2 left join c on user.id = c.id right join d on d.id = e.id"))
print(sql_metadata.get_query_tables(sql))
print(sql_metadata.get_query_tables("select x1, x2 from (select x1, x2 from (select x1, x2 from apple.a)) left join orange.b as ob on a.id=ob.id   where b.id in (select id from f)"))
print(sql_metadata.get_query_tables("select * from user as u where u.id = 99"))

输出结果

['user', 'user2', 'c', 'd']
['liepin.a', 'b', 'c']
['apple.a', 'orange.b', 'f']
['user']

get_query_tables方法的效果是最好的，但也不是没有瑕疵，这个问题出在sqlparse上，在解析sql时，'left join'会被解析成一个token，sql_metadata设置了一个关键词列表

table_syntax_keywords = [
        # SELECT queries
        'FROM', 'WHERE', 'JOIN', 'INNER JOIN', 'FULL JOIN', 'FULL OUTER JOIN',
        'LEFT OUTER JOIN', 'RIGHT OUTER JOIN',
        'LEFT JOIN', 'RIGHT JOIN', 'ON',
        # INSERT queries
        'INTO', 'VALUES',
        # UPDATE queries
        'UPDATE', 'SET',
        # Hive queries
        'TABLE',  # INSERT TABLE
    ]

如果left 和 join之间存在多个空格，就无法匹配这里面的关键词，导致解析不正确，因此需要在解析前，对sql进行预处理。虽然sqlparse也提供了sql美化的功能，但并不能处理left join 之间的空格。我在上面的代码中使用了比较讨巧的方法，使用split函数对sql字符串进行切分，再用空格连接，可以得到比较理想的效果。

4. 对sql进行处理

尽管sql_metadata已经表现的很优异，但仍有瑕疵，如何sql语句中存在注释，sqlparsse在解析sql语句时就会出现问题导致解析结果的不准确，因此需要对sql进行清洗，这里就借鉴了正则表达式的方法，去除那些无用的注释

def clean(sql_str):
    # remove the /* */ comments
    q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)
    # remove whole line -- and # comments
    lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]
    # remove trailing -- and # comments
    q = " ".join([re.split("--|#", line)[0] for line in lines])
    q = ' '.join(q.split())
    return q

在抽取table之前，先对数据进行清洗，避免注释影响结果