SourceQL Paper 1: Final Report for EECS 433 (Database Systems)

18 pages

Please download to get full document.

View again

of 18
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Introduces the idea of SourceQL, a query language and indexing layer over source code management systems to allow simple and powerful data mining.
  SourceQL: A Query Language for Source CodeManagement Systems Tim Henderson and Steve JohnsonCase Western Reserve UniversityDecember 17, 2009 Abstract Version control repositories contain vast amounts of data about the history of aproject. To date, people have not tried to extract much useful information from theserepositories, largely because data formats vary and access time is very slow. Thispaper presents the foundations for a system which aims to vastly improve read timesfor specific kinds of repository data and make queries against this repository easier towrite. In short, we want to treat the repository like a database as much as possible,including reasonable access time and a powerful query language.This paper focuses on storing and querying path data in the repository, i.e. parent-child relationships between nodes. Specifically, it describes a system for storing a par-allel, efficient representation of the connectivity graph, as well as an efficient algorithmfor finding all nodes on any path between two nodes.We were not able to rigorously test the real-world performance of our algorithmsdue to time constraints and changing designs, but we have opened up many areas forfuture research. 1 Introduction Since the invention of computers, code bases have become progressively larger and theirdevelopment has involved more people working in parallel. Early SCMs were created todeal with this growth, with varying degrees of success. Today, many code bases consist of hundreds of thousands of lines of code, with thousands of changesets in the repository.As software projects become larger, it becomes more difficult to find out things aboutthe source code, such as where bugs are located, or whether duplicate code is being writ-ten. Source Code Management (SCM) systems have become nearly ubiquitous in softwaredevelopment environments, and the information stored in them can be useful in answeringcertain questions.The data contained in these repositories contain a wealth of information about the sourcecode. Currently, that information is only used to perform transformations on text files, not1  provide information about those files. By treating a repository as a database, providingefficient and intuitive ways to query it in a reasonable aount of time, we can discover newways to measure progress, find bugs, and identify general qualities in a software project.Before any information can be read from a repository, the search space must be defined.Since the repository is a directed acyclic graph, one cannot simply specify a linear chainof commits, since many of those commits are based on branches and merges by differentcommitters who contribute equally. A more sophisticated and powerful way to define thesearch space is to specify two end points and find all nodes that are on any path betweenthem.This paper focuses exclusively on defining the search space, or “subgraph selection.” Todo this in a reasonable amount of time, the entire repository must be indexed. This indexmust be updated incrementally for each new commit and can be optimized significantly dueto the nature of version control repositories.Section 2 outlines some special characteristics of repository graphs, which helps to deter-mine what is possible, what is efficient, and what is correct. Section 3 describes an efficientway to create and maintain an index of the parent-child relationships in a repository graph.Section 4 describes an algorithm for subgraph selection on this index. Section 5 describes theexact file structure of the graph representation introduced in section 3. Section 7 outlinesvarious approaches we tried and abandoned. Finally, section 8 gives some conclusions of ourcurrent research. 2 Assumptions Since a repository is a special case of a graph, we can make the following assumptions basedon the nature of version control.1. New commits are based on existing data2. The commit time of a node is always less than that of its children3. A directed graph with all arrows parent → child or child → parent is acyclic4. Edges and nodes never change once created5. No node has more than two in-edges 2.1 Commits Based on Existing Data A commit must be based on data that exists at the time of the commit, meaning that newinformation cannot be added later. This restriction is enforced by the interface of each SCM.2  2.2 Commit Time Since a new commit is based on a previous commit, that previous commit must have beenmade earlier in time. Therefore, the new commit gets a new time stamp.This means that a sorting of commits by real-world commit time is also a topologicalsort. However, we do not use commit time as a topological sort because individual users’clocks may differ. 2.3 Directed Graph is Acyclic We can construct a directed graph following either commit parents or children. Such a graphwill have no cycles.This assumption follows from (2.2). If there were a cycle, then there would be a commitwith a child with an earlier time stamp, violating the topological sort. 2.4 No Changes to Edges or Nodes This assumption follows from (2.1). Since a commit is based on existing data, if that datachanges, then the commit will no longer refer to valid data.This assumption does not hold under certain special cases in some SCMs. Most modernSCMs allow changes to the most recent commit. A few allow the user to convert branchesinto a series of linear commits in order to make the history log more readable [2, ch. 6.4].When this happens, it is impossible for a stateless observer to determine what changes weremade without scanning the entire repository from scratch.Changes such as these cannot be made after a repository has been merged with anothercopy of the repository in a distributed SCM, as this would make further merges impossible.However, it might present a problem in centralized SCMs if the administrator decides tomake changes at the server level.In any case, dealing with these edge cases is out of the scope of this paper. 2.5 In-Edges According to all existing SCMs, a commit is a set of changes to a previous state of therepository. If those changes are the result of the user, then a commit will have one in-edgerepresenting the previous commit. If those changes are the result of a merge from anotherbranch, then the commit will have two in-edges, one for each branch.This assumption allows us to state that O ( E  ) = O ( V  ) for all possible repositories, al-though E  will be higher on average than V  . 3 Graph Representation Most SCMs store repository information in a way that is optimized for reading and writingas little information as possible. In order to optimize queries and to put an abstraction layer3  over the SCM implementation, we create a separate graph representation of the repository.Since a repository is a sparse DAG with a low upper bound on incoming edges, the mostefficient representation is adjacency lists. 3.1 Initialization and Update Almost all SCMs offer “hooks” which allow custom scripts to be run when SCM commands(e.g. commit, merge) are executed. SourceQL can update its graph representation usingthese hooks.Some SCMs, including Mercurial[5], provide information to certain hooks. Mercurialprovides the first ‘new’ commit in a set of incoming commits, so a list of all new commitscan be obtained by traversing those children which are not already in the graph.If we have a list of newly-created commits, then all new commits can simply be addedto the vertex list, and then edges can be added to various adjacency lists. This can be donewithout regard to order.If the SCM does not provide this information, then it must be determined through othermeans. The easiest way to do this is to start at the last node (a leaf node) and traverse theparents which are not already in the graph. 3.2 Topological Sort Maintaining a topological sort order of the repository is useful for a number of reasons. Oneis that it allows us to refer to each commit as a 32-bit unsigned integer rather than relyingon whatever the SCM uses, which is usually either a revision number or a SHA1 hash. Italso allows us to make some optimizations to our searching algorithms.Commit time provides a topological sort order in theory, but in practice, it is subject toindividual machines’ internal clocks, which might be wrong for various reasons. It is betterfor us to maintain our own sort order. Fortunately, we can do that in constant time for asingle commit, and in linear time for a set of commits. One algorithm to do this is describedin Kahn[3].The internals of SourceQL represent all nodes as their topological sort position to savespace and maintain a layer of abstraction over the SCM. 3.3 Reduced Graphs Since repository graphs have long, linear chains of nodes, we can make some significantoptimizations by compressing chains of nodes into single nodes called collapsed nodes .Some SCMs already do this[4], but they have not documented their approaches.We define a collapsed node as a node and all of its children until a node division. Thereare 4 places where collapsed nodes are divided:1. Root of a repository2. Leaf (head) of a repository4
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!