[solved] Too many regex: how to optimize processing speed ?

Post reply

❤Follow Topic(3)

0 favourites

6 posts

From the Asset Store

Surena system

$180 USD

A system for storing and processing information online

surena

- Estecka
- - Joined 21 Dec, 2015
  - 12 topics • 56 posts
- 1
- 28 Jun, 2016
- Quote
I made a code that analyze a provided HTML chunk to find all the occurences of a certain balise, and retrieve the unique data within each and every of these balises.

It's a matter of about 200 (maybe 1.000) occurences per strings, from a file of 3 millions of caracters, with several of these files.

So far with a shorter file of 32 occurences, it takes already 10 seconds to find them all, but with the bigger files, it take 44s to find only 32 of the occurence, so I expect 5mn to process the whole file.

There are oblivious flaws in my code, it's kind of dirty, but I know no other way of going around.

Most notably, I don't know how to retrieve more than one variable per regex test, and I don't know how to retrieve the Nth match of a regex variable if the variable has multiple match.

As a result, my code looks like this:
- Every tick, if the string match the regex (First test)
- - Set a dictionnary entry to a RegexMatchAt() (Second test)
- - Set the original string to a RegexReplace() (third test) to remove the previously gathered match, so it's not matched again.
starts over at the next tick until done.

38 millions caracters processed a three thousands times does sound like a lot of processing.

The back of my mind is telling me I could gather all datas in a single test, but I just have no idea how.
- Gearworkdragon
- - Joined 2 Feb, 2015
  - 58 topics • 390 posts
- 1
- 29 Jun, 2016
- Quote
Isn't construct two debug performance check is due to single core and not multi-thread or layering or multi-core processors ? Have you tried to export it as a stand alone exe program and see if your own computer could handle it without the limitation of debug ?
- Estecka
- - Joined 21 Dec, 2015
  - 12 topics • 56 posts
- 1
- 29 Jun, 2016
- Quote
Isn't construct two debug performance check is due to single core and not multi-thread or layering or multi-core processors ?

I... simply didn't understand that bit.
- R0J0hound
- - Joined 15 Jun, 2009
  - 91 topics • 7,641 posts
- 1
- 29 Jun, 2016
- Quote
Without a capx it's hard to come up with a solution. You could get it all in one go with a loop

repeat RegexMatchCount times

--- dictionary: add key RegexMatchAt(loopindex)

To make it so it doesn't stop at the first occurrence you could use "g" as the flags parameter. That way you wouldn't need to replace the text and make the regex need to process the text again.

There may be other ways to do it. Maybe this could give ideas:
Try Construct 3

Develop games in your browser. Powerful, performant & highly capable.
Try Now Construct 3 users don't see these ads
- Estecka
- - Joined 21 Dec, 2015
  - 12 topics • 56 posts
- 1
- 29 Jun, 2016
- Quote
Thank for the sugesstion.

I tried to avoid "actual" loop so I could display the process progress in real time (not processing everything in a single tick) but making up a "fake" loopindex shouldn't be too hard, so I'll look into it.

Let's work out how to use that g flag .

I'm heavily relying on Pode's HTML plugin into this project, so I don't know whether you could read it without the plugin installed.
- Estecka
- - Joined 21 Dec, 2015
  - 12 topics • 56 posts
- 1
- 1 Jul, 2016
- Quote
I found an absolutely brutal optimization. 3.500% faster !

A string that used to take 10mn to process is now nailed in 17s !

Let's suppose my string looks like this, where - represents junk data :

----DATA1-------DATA2------------------------DATA3------

There is an awful lot of junk in-between the datas I want to collect, I only have use of 1 caracter out of 14.

So, what I'm doing before processing the string, is filtering out all of the junk code thank to a single regex test:

RegexReplace(TextBox.Text, ".+?(DATA[0-9]).+?", "giu", "| $1 |")

which returns me an amazingly shorter string:

|DATA1||DATA2||DATA3|------

There's still a bit of junk at the end somehow, but so little compared to before I barely even care.