Searching

tappatekie · Post by **tappatekie** » Wed Nov 23, 2011 5:38 pm

Hello, I am currently developing a social network but with a twist. The web server and database server is being written from scratch in c# with my own scripting language to execute the site.

I was wondering, is there a quicker way to find an entry in a database rather than looping through every one? I currently have a (what I like to call) "bucket search" which was inspired by Pixar, it basically searches multiple locations of the database at the same time, once it finds the entry, it "kills" all the other buckets and returns the row. Every bucket is multi-threaded.

superLED · Post by **superLED** » Wed Nov 23, 2011 6:45 pm

Are you searchin the whole database, or only in a specific table?
And are you using mysql or similar?

tappatekie · Post by **tappatekie** » Wed Nov 23, 2011 6:55 pm

Laughed at the picture(s) lol!
But like I said in the post. I am literally writing everything from scratch including the database system and yes I am searching every entry in the database but only in a given table.
At the moment it searches with 7 buckets as in 7 threads are created. I don't want to make too much or too little because of the server load.

LeonBlade · Post by **LeonBlade** » Thu Nov 24, 2011 5:08 am

>C#
No thanks.

By the way, be careful you don't want to spill your buckets.

tappatekie · Post by **tappatekie** » Thu Nov 24, 2011 5:20 am

LeonBlade wrote:By the way, be careful you don't want to spill your buckets.

ROFL and spilling the buckets if you mean technically is impossible. But laughed at the metaphor

Post by **dandymcgee** » Thu Nov 24, 2011 10:46 pm

tappatekie wrote:I am literally writing everything from scratch including the database system and yes I am searching every entry in the database but only in a given table.

If this project is for personal experience, or research, then this sounds awesome. However, if you plan to actually use your system for a public website then you are grossly underestimating the intricacies of a modern day relational database. You will almost certainly want to use MySQL or similar rather than a home brewed solution. For now, I'll assume this is a personal research project.

Even if you don't plan to have this system used by another aside from yourself, and perhaps a few friends, there is much that can be learned from how existing databases manage data storage and queries. If our databases did a simple linear search on the entire data set every time a query was made, most of today's websites would not be even close to possible. A good place to start learning elementary methods for searching data sets is reading up on Binary Search Trees. BSTs are the founding idea of even the most complex structures for querying data. Any large database is going to need at least one sorted index. More realistically, there is an index for any particular field (generally corresponding to one column in the database) that will be searched on.

If you have any specific questions, feel free to ask them here.

tappatekie · Post by **tappatekie** » Fri Nov 25, 2011 1:42 am

Thanks for the advice

It's a bit of both with the commercial/experience because I am seeing how far I can go with the database system.

However I have benchmarked the query [with the buckets] on a database with 100,000 entries all randomly generated with the last entry fixed to a integer value and I used a query to find the last entry, I was surprised by the result of it taking 125ms on a Intel Atom processor.

Oh and the database entries have their own index flag at the first column

szdarkhack · Post by **szdarkhack** » Fri Nov 25, 2011 5:52 am

tappatekie wrote:Thanks for the advice

It's a bit of both with the commercial/experience because I am seeing how far I can go with the database system.

However I have benchmarked the query [with the buckets] on a database with 100,000 entries all randomly generated with the last entry fixed to a integer value and I used a query to find the last entry, I was surprised by the result of it taking 125ms on a Intel Atom processor.

Oh and the database entries have their own index flag at the first column

This is a great result that clearly shows that commercial database managers (software around 60MB or more), are way way ahead. This is due to 2 facts. First, the size. When a piece of software (pure software, not resources) is that large, it means it was created by a large team and over a long period of time. Much more than what a single person, however smart, can possibly achieve. Second, the development time. Most such systems have been in development for many years now, so the techniques they've come to use are the result of countless iterations.

It still is, however, a very useful experience, so let's see how you can improve your running time. I have a number of suggestions:

-First off, you are going to HAVE TO use sorted indices. Linear search (O(n)) can NOT compete in any way with BSTs (O(logn)), so that's a good start. There are other data structures as well (such as hash indices for a specific type of queries), but BSTs are very efficient and very simple and they work quite well in all kinds of indexed queries, so i would stick with that.

-Second, you might want to try to write a very simple query optimizer. Nothing too complex, just convert SQL to a relational algebra tree and try to identify when you can use indexed columns to speed up your queries. If you have no idea what the above meant, i assume you are using your own query language. Again, try to identify when an indexed column is used in the query to use BST search instead of linear search.

-Third, BST search is a structure that lends itself very well to parallelism. What i mean is that when you search through the tree, expanding one node does not depend on it's siblings, so you can create a bunch of threads to vastly improve your search time. And the awesome thing is that you don't even have to lock anything in this particular case, each thread will be completely independent.
-EDIT: Actually never mind that, the overhead of threading would kill any benefit because each thread would not really traverse the tree. Stupid idea, you won't need threading with BSTs. It WOULD help, however, in linear search, i.e. when an index cannot be used in the query. In this case, you can split the data (say, you have 100k rows, you can make 5 threads each searching only 20k rows). This WILL improve your search time, as long as you don't keep reading data from the disc. What i mean is that you should read each thread's "slice" of the data and keep it in memory BEFORE you start it. If the column as a whole is too large, you should "schedule" its reads. For example, as soon as one of the 5 threads is done, release it's memory, load the next 20k records and restart it with the new data. You can optimize this of course, but i think you get the idea

These are just a few suggestions for what you're trying to do. I would also strongly recommend http://www.db-class.org/, since you're interested on the subject. If you're not already taking the course, even the lecture videos alone are a great help. Of course there is homework if you're willing to devote your time, but even the videos alone will give you a much better idea on how a DBMS works.

tappatekie · Post by **tappatekie** » Fri Nov 25, 2011 11:06 am

Thanks, I will attempt to do the first one and look at the site.

And btw, the server pre-loads the database into memory (I know it's a bad plan but planning on a re-write since some of the database was old code) but Ive coded a way for multiple servers to communicate with each other. Like 1 query will go to 4 servers and if one server sends the found Y, then it tells the other servers to stop.

Atm, the file for the database is like this (I did'nt plan or think of the format that much so don't think im stupid for doing it like that) [column length][row count][entry byte length][entry] table token = byte[] column length = byte row count = byte[] entry length = byte[] entry = byte[] Each entry has a start byte value of how big that data is. And yes I am using my own query language (sort of) It's really simple and basic e.g "0=Hello&1=Test" checks if the first and second index equals "Hello" and "Test" Oh and also I found a way to have a ulong (unsigned 64 bit integer) into a byte array It works like this [byte length][8byte chunk of the binary code] It converts the ulong to binary, then splits it up into 8 bits each, hense the byte bit.. My plan is to compress all this into a format like this [table token][column count][index byte length][index=700] (location in the file)[/index][entry][end token][/entry] table token = byte[] column count = byte index byte length = byte[] entry = byte[] end token = byte This is to allow the sysem to load the file and only get the position it wants quickly without loading the whole file, and also skips the other entries since it knows exactly where in the stream it is. What I could possible have as well is sorted index at the header [I use QuickSort for the sorting] then if I want say a value to have "Hello", it knows where "H" is in the sorted list (vague idea)

szdarkhack · Post by **szdarkhack** » Fri Nov 25, 2011 12:17 pm

If i understand correctly, you have the table as an "attribute" for each row. This can lead to some issues, for example, if you insert rows into two tables, one after the other, the rows will start getting mixed up and you'll have to do a lot of work to seek to the correct row. I would suggest that you store each table in a separate file for simplicity and consistency. Of course, you can also keep many tables in one file, but inserting will become a big hassle.

Now, indices. You should have your indexed column *sorted*, either in a separate file or at the beginning of the table file in the following structure:

[value][seek_position]

value = the value of the index
seek_position = the position of the corresponding row in the table file

In this manner, when you run a query using an index, you will only need to do a BST search through the sorted index, get the seek position and seek to it in the table file. Much faster than searching through a sea of unrelated rows (even rows that are not in the same table).

Again, make sure to take a look at the course i linked you to, it's obvious that you're really interested in database systems, and that course will help you understand much more about how they work. Anyway, i hope the suggestion above gets you on the right track on speeding your system up

tappatekie · Post by **tappatekie** » Fri Nov 25, 2011 3:13 pm

the is only there to tell the parser that it's a table, it's not to split up tables. I have 1 file per table and it works like The index byte length is their to tell the stream how much it should read when it wants the table index [table token][column count][index byte length] [index=700] (location in the file)[/index] [index=700] (location in the file)[/index] [index=700] (location in the file)[/index] etc... [entry][end token][/entry] [entry][end token][/entry] [entry][end token][/entry] etc.... Sorry if I was'nt clear And i meant by index=700 by [index value]=[seek value] or (location in file) It has'nt actually got an equal character, it's there for simplicity so in the db it'l be like 103700 it's length then value (due to simplicity, ive done the length of the string not the byte value for 7000 or 0) length: 1 value: 0 ===EQUAL=== length: 3 value: 700 And in the query, the index in the query is the x coord, like if the first entry on the row is.... etc I have had a look at the database lecture site and I see some pretty interesting stuff, but I can't watch any properly yet since I have to comment every line of code (nearly) for the webserver, database [current build] and file management system. It comes down to 10,000 lines of code :8, I did'nt comment because I was, well lazy but then realized that, "wait, I may need to go back and see this code..." so I just decided that now is the best time to do it before I write another line.

tappatekie · Post by **tappatekie** » Fri Nov 25, 2011 3:36 pm

Sorry don't mean to spam, but I just came up with a good Idea,

What if I have all the rows in one go as in

Row1Entry1,Row1Entry2,Row2Entry1,Row2Entry2

Then just say like, I want 0 to be "Row2Entry1"

So I pass y=1, x=0
so the location will be (y+1)*(x+1) //Still thinking of this formula....

Maybe I can pre-index all there lengths? then do
seek = (loop until y but add the length of all the rows it comes across) + x

Rebornxeno · Post by **Rebornxeno** » Sat Nov 26, 2011 6:01 pm

How do you find the book your looking for in a library? Maybe the two things aren't much different! :O

tappatekie · Post by **tappatekie** » Sun Nov 27, 2011 9:24 am

I suppose

Post by **dandymcgee** » Sun Nov 27, 2011 9:58 am

Rebornxeno wrote:How do you find the book your looking for in a library? Maybe the two things aren't much different! :O

Wow...

Elysian Shadows

Searching

Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching

Re: Searching