SourceQL Paper 4: Progress Report 1 for EECS 405 (Data Structures & File Management)

8 pages
120 views

Please download to get full document.

View again

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Re-contextualizes SourceQL for use as the semester project in another simultaneous course. Includes a literature survey of existing SCM abilities, paradigms, and storage mechanisms.
Tags
Transcript
  EECS 405 Project Progress Report #1 Tim Henderson, Steve Johnson, and Gary DoranCase Western Reserve UniversityMarch 16, 2010 Abstract We are implementing a handful of disk-based file structures in order to supportspecialized indices in the SourceQL project. SourceQL is a large project which is beingdeveloped for the EECS 395 course, but it is large enough that modular parts of itcan be considered their own project. For the EECS 405 portion of the project we areimplementing disk-based blocks and records, B-trees, B+-trees, sequential files, andhigh-level relations on top of the structures. 1 Background 1.1 Source Code Management Systems For a project with a large code base that requires collaboration between a team of programmers, a Source Code Management (SCM) system is an invaluable tool fordevelopers. Modern SCMs allow programmers to commit changes made to sourcecode, which is stored in a repository. A commit represents a state of a set of files.Changes are represented using diffs , which can be thought of as transitions betweenstates of the repository. Therefore, one can construct a commit graph, which representsthe history of a repository, where nodes correspond to repository states and directededges store information such as the diff, the user who committed the change, and whenthe change was made.Because changes can only be applied to a repository in a limited number of ways,commit graphs have certain properties that distinguish them from the general case.First, because each commit is based on exactly one, two, or zero previous commits,commit graphs are directed acyclic graphs. Second, because commit graphs representthe history of a repository, nodes and edges are never deleted or changed after they arecreated. The properties described above are useful to those who need to access andmanipulate data stored in a commit graph. 1.2 SourceQL SourceQL is a project under development for the EECS 395 course (Senior Projects).We are using it for both courses because it is such a large project. We are including a 1  Figure 1: SourceQL Architecture ReTrieSourceGraphpath + regex queryquery parserregex compilerpath querycompilermatches(rev_id) joinmatches(rev_id, file, line #, text)matches(rev_id, file, line #, text)regex VMtrieindexgraph query VMcommitgraph summary of it here, but a more complete description can be found in [5]. 1.2.1 Purpose Currently, SCMs are used mostly for the purposes of textually transforming a repos-itory between states, and for browsing through the history of a repository. There isa wealth of information contained in commit graphs that might be used for code re-views, debugging, or detecting duplicate code. However, there does not yet exist aframework for systematically and efficiently querying repository history. The goal of the SourceQL project is to design a query language and implement a system to performregular expression searches of commit graphs [4]. 1.2.2 Design Overview SourceQL queries are comprised of two major components, as shown in Figure 1. Onecomponent is a regular expression search of commits, which is performed using a triestructure. The other component is a path query of the commit graph, which allowsusers to specify results occuring between commits X  and Y   in the repository history.Both components of a query require efficient disk-based index structures and an efficient join procedure to combine results. 2  1.2.3 Status SourceQL is an ongoing senior project by Tim Henderson, Steve Johnson, and EvanKrall. Because it is a large project, it can be compartmentalized into subsystems, someof which require implementation of algorithms and data structures discussed in EECS405. The subsystems that will be implemented for the EECS 405 project are discussedin Section 2, and the status of the remaining subsystems are briefly discussed here forcontext.Steve Johnson has worked on integrating SourceQL with a popular SCM system,Mercurial. This allows data from commits to be obtained and inserted into the datastructures used by SourceQL. Some data structures are prototyped naively, but as theappropriate algorithms and data structures are implemented, they are being replaced.Evan Krall has worked on implementing trie-based data structures including B-tries, Patricia tries, and suffix tries. These structures are currently implemented inmemory, but will be replaced with disk-based implementations as they are available.Evan is also implementing a virtual machine for running regular expressions on stringsusing the data structures he has created.Tim Henderson has worked on implementing a framework for disk-based data struc-tures in the relatively new systems programming language, Go. Tim has implementeda block-based file format for use with B-trees and similar data structures. 2 Problem Definition There are several subsystems of SourceQL that pertain to topics discussed in EECS405, as well as active areas of research in data structures. Because the Go programminglanguage is relatively new, there are not yet libraries for low-level, disk-based indexstructures. While Tim Henderson has worked to implement a block-based file formatfor tries, B-Trees and other data structures, these implementations do not yet supportdeletion, and block allocation is accomplished by simply appending a new block to theend of the file.Additionally, graphs must be indexed so that path queries can be efficiently per-formed. Research papers can be consulted to determine both how to effectively storecommit graph information on disk and how to quickly query the data. If possible, tech-niques that take advantage of the properties of commit graphs will be used to informimplementations of data structures and algorithms in Go.Finally, a query language and compliler must be designed and implemented whichtake input queries and split them into regular expressions and path queries. Bothquery types must be converted into statements that efficiently utilize the strucutresimplemented for the graph and trie data. The subcomponents of SourceQL discussedabove require techniques which are both material taught in EECS 405, as well asactive areas of research. Therefore, the design and implementation of these SourceQLcomponents is both substantial and relevant to this course. 3  3 Literature Survey 3.1 History of VCSes The modern concept of version control systems goes back to 1972 when the SourceCode Control System (SCCS) began to be constructed [11]. SCCS was developed atBell Labs concurrently with Unix and other projects at the time. It was developedas part of development tool called the Programmer Workbench and integrated tightlywith it.SCCS introduced many important concepts such as storing changes as deltas, lock-ing files during editing and assigning unique identifiers to each revision of a file. How-ever, SCCS is fundementally very different from a modern system. It operates on onlyone file at a time (refered to as a module in the paper) and does not treat versionhistory as a graph or tree but rather as a linked list.Another important (and still widely used) early version control system is known asRCS. We first learn of its development in a 1985 paper in reference [14]. Accordingto the source code of the program Version 3 was released in 1983, and we have noknowledge of its state before that point. RCS continued in the tradition of SCCSoperating mainly on one file, and using locking to synchronize edits. It did introduce acrucial development: rather than view the history of a file as a list, like SCCS, insteadRCS views the history as a tree.When RCS was developed, and indeed when SCCS was developed, using computernetworking for software developement did not happen for all pracitical purposes. Alldevelopers worked on the same mainframe computer. Thus, file level locking seemed tobe the best approach for synchronizing developement. Additionally, solving the cornercases for merging divergent changes appeared to be untractably difficult [10].Developements in computer networking in the 1980s dramatically changed softwaredevelopement. The Concurrent Versioning System (CVS) was developed to addressthese radical changes. CVS was built directly on top of RCS. It used RCS for the basicversioning operations on files. Indeed, the first version of CVS was simply a collectionof shell scripts which attached RCS to the network. These shell scripts soon became fullC programs, and for quite sometime CVS dominated the open source version controlscene [10].Unfortunately CVS had fundemental architectual problems thanks to its relianceon RCS for basic operations. Subversion (SVN) was developed in the early 2000’s toaddress these issues and has rapidly replaced CVS for many open source and propri-etary projects. SVN represents to Raymond’s mind the pinnacle of the centralizedlocking model of version control developement.The early and mid 2000s witnessed an explosion of work on a new kind of VCS.The fundemental properties of this new type of VCS (Raymond’s Type 3, and Alwis’sDVCS) include: decentralization, synchronization through merging, and cheap branch-ing. The earliest DVCS system was Reliable Software’s Code Co-Op which was builtin 1997. The company had found that the centralized model of version control did notfit its needs since its development teams were spread well accross the globe [2]. For afairly comprehensive survey of these systems please see [10].The most important DVCSs to come out of the cambrian explosion of version 4
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x