LFS Forum - Simple(?) PHP regex

#1 - broken

Simple(?) PHP regex

Tue, 1 Feb 2011 13:54

Hi, I am having a little difficulty with regexes, as you have already probably guessed from the title.

Actually, I've learned them a few times now. But every time I need to use one, I need to learn them again. They just don't want to stay in my head.

Basically, what I want to achieve, is to match a specific letter, and all the letters that follow after it, until it reaches another character(which isn't anymore a letter). I want to do it for Cyrillic letters, if that matters. But it should be the same thing as for Latin ones, I guess.

I am using http://www.functions-online.com/preg_replace.html to test as soon as I find something that might solve the task.
Here's what I'm testing with now:

$pattern = array('/я[а-я]+[^а-я]/', '/ф[\p{L}]+[^\p{L}]/', '/м[\p{Cyrillic}][^\p{Cyrillic}]/')
$replacement = array('january', 'february', 'march')
$subject = януари февруари some english март

January almost works.
February is a little worse.
March just doesn't.

In January, I think I only need to make it not stop after same matches? I tried placing a dot before the plus ('/я[а-я].+[^а-я]/'), but then it returned only "january".. xD

Thanks for any help in advance!

[E] The result from this should be "january february some english march".

#2 - thisnameistaken

Tue, 1 Feb 2011 14:12

I am having trouble understanding what this is supposed to do, I think all the Cyrillic characters are confusing me.

So the first pattern should pull the word for january out of the string? And replace it with the English word 'january'?

If you're writing a pattern to match only one specific substring (it seems you're using different patterns for february and march) wouldn't it make more sense to just use str_replace()?

#3 - broken

Tue, 1 Feb 2011 14:25

Quote from thisnameistaken :I am having trouble understanding what this is supposed to do, I think all the Cyrillic characters are confusing me.

So the first pattern should pull the word for january out of the string? And replace it with the English word 'january'?

Yes.

Quote from thisnameistaken :If you're writing a pattern to match only one specific substring (it seems you're using different patterns for february and march) wouldn't it make more sense to just use str_replace()?

Well, I just want it to be more flexible. You know, in English, you can write February, or you can write Feb. You can do the same in Bulgarian.

Anyway, I think I found a solution.
Just make the pattern like so: F[ebruary].
Of course, that would mean that June and July patterns would be like this: Jun[e] and Jul[y].

#4 - thisnameistaken

Tue, 1 Feb 2011 14:42

Ah OK. Well using str_replace would mean you need 24 strings/replacements instead of 12, which might seem like an untidy way to do it, but str_replace runs a lot faster than preg_replace.

But now I understand what you're trying to do, I don't understand what the problem is.

You say given the Bulgarian word for 'january' your function is returning 'january' - is this not correct behaviour? Can you not just copy the same style of pattern to use for february, march, etc.?

#5 - NotAnIllusion

Tue, 1 Feb 2011 14:49

It's only replacing & returning the first word in the line, but he wants the rest replaced where possible as well I guess.

#6 - thisnameistaken

Tue, 1 Feb 2011 15:11

Quote from NotAnIllusion :It's only replacing & returning the first word in the line, but he wants the rest replaced where possible as well I guess.

But there are no more instances of the word 'january' in the string.

#7 - NotAnIllusion

Tue, 1 Feb 2011 15:31

"$subject = януари февруари some english март" => The result from this should be "january february some english march".

It is supposed to replace and return all cyrillic words in the input string in one go, it seems to me.

#8 - broken

Tue, 1 Feb 2011 16:00

Quote from NotAnIllusion :"$subject = януари февруари some english март" => The result from this should be "january february some english march".

It is supposed to replace and return all cyrillic words in the input string in one go, it seems to me.

Indeed. Sorry for the late response.

But, there are more than 24 instances in the Bulgarian language. It's just to be more flexible as I said. And there is a reason for that.

Different people write them in different ways. If I support 24, why not just support the way they will all start typing them?

Example with what should be turned into "September":
Септември (Translates: September)
Се (Translates Se. Unlikely, but not impossible to be shortened that way.)
Сеп (Translates: Sep)
Септ (Translates: Sept)

And, with the regular expression I'm using(the last one I posted), it's also typo-safe. It may use more resources, but in my case, it's better than reloading the page just to tell the user their date has not been recognized.

There's the code I'm using, and it works fine so far:



<?php 
function validate_months($string)
{
    $invalid = array
    (
        "/я[нуари]/",
        "/ф[евруари]/",
        "/мар[т]/",
        "/ап[рил]/",
        "/май/",
        "/юн[и]/",
        "/юл[и]/",
        "/ав[густ]/",
        "/с[ептември]/",
        "/о[ктомври]/",
        "/н[оември]/",
        "/д[екември]/"
    );
    
    $valid = array
    (
        "January",
        "February",
        "March",
        "April",
        "May",
        "June",
        "July",
        "August",
        "September",
        "October",
        "November",
        "December"
    );
    
    $string = preg_replace($invalid, $valid, mb_strtolower($string));
    // And to strip other weird stuff, I've created another function
    $string = strip_weirdness($string);
    return $string;
}
?>

PS: Sorry for the confusing reply. Hope you understand what I'm trying to say.

#9 - Dygear

Tue, 1 Feb 2011 17:42

Broken, if I understand the situation correctly, I am pretty sure that is the best your going to do. There is no better solution, that I can see, then the regex you posted. So congratulations, your home.

#10 - broken

Wed, 2 Feb 2011 16:28

[Edit] Actually, this isn't so true... Needs more tweaking...

Thanks everyone.

For anyone who cares about the final solution: The last code I posted worked fine, until I started testing it thoroughly. It failed here and there, so I started tweaking it. And the final result is exactly the answer to my very first question.

The regex is actually very simple(if you understand them), but it took me around 3 days to figure it out. The code looks like that:


		$invalid = array
		(
			"/[^а-я]я[.^а-я]*/",
			"/[^а-я]ф[.^а-я]*/",
			"/[^а-я]мар[.^а-я]*/",
			"/[^а-я]ап[.^а-я]*/",
			"/[^а-я]май[.^а-я]*/",
			"/[^а-я]юн[.^а-я]*/",
			"/[^а-я]юл[.^а-я]*/",
			"/[^а-я]ав[.^а-я]*/",
			"/[^а-я]сеп[.^а-я]*/",
			"/[^а-я]о[.^а-я]*/",
			"/[^а-я]н[.^а-я]*/",
			"/[^а-я]дек[.^а-я]*/",
			"/[^а-я]сек[.^а-я]*/",
			"/[^а-я]ми[.^а-я]*/",
			"/[^а-я]ч[.^а-я]*/",
			"/[^а-я]ден[.^а-я]*/",
			"/[^а-я]дни[.^а-я]*/",
			"/[^а-я]ме[.^а-я]*/",
			"/[^а-я]г[.^а-я]*/"
		);

And in english, that is similar to:


		$invalid = array
		(
			"/[^a-z]ja[.^a-z]*/",
			"/[^a-z]f[.^a-z]*/",
			"/[^a-z]mar[.^a-z]*/",
			"/[^a-z]ap[.^a-z]*/",
			"/[^a-z]may[.^a-z]*/",
			"/[^a-z]jun[.^a-z]*/",
			"/[^a-z]jul[.^a-z]*/",
			"/[^a-z]au[.^a-z]*/",
			"/[^a-z]s[.^a-z]*/",
			"/[^a-z]o[.^a-z]*/",
			"/[^a-z]n[.^a-z]*/",
			"/[^a-z]d[.^a-z]*/"
		);

I'm using the english one for translation purposes - to bring back the correct bulgarian name of the month. Yes, maybe it's too complicated and not optimized, but I was looking for a reason to learn something new.

I might explain what the regex does, for those who want. But that is if I have time for it, and if I actually remember to do it.

#11 - broken

Thu, 3 Feb 2011 10:52

Final, perfectly working solution:

Bulgarian to English:


		$invalid = array
		(
			"/(?<=[^р-я])я[.^р-я]*/",
			"/(?<=[^р-я])ф[.^р-я]*/",
			"/(?<=[^р-я])мар[.^р-я]*/",
			"/(?<=[^р-я])ап[.^р-я]*/",
			"/(?<=[^р-я])май[.^р-я]*/",
			"/(?<=[^р-я])юн[.^р-я]*/",
			"/(?<=[^р-я])юл[.^р-я]*/",
			"/(?<=[^р-я])ав[.^р-я]*/",
			"/(?<=[^р-я])сеп[.^р-я]*/",
			"/(?<=[^р-я])о[.^р-я]*/",
			"/(?<=[^р-я])н[.^р-я]*/",
			"/(?<=[^р-я])дек[.^р-я]*/",
			"/(?<=[^р-я])сек[.^р-я]*/",
			"/(?<=[^р-я])ми[.^р-я]*/",
			"/(?<=[^р-я])ч[.^р-я]*/",
			"/(?<=[^р-я])ден[.^р-я]*/",
			"/(?<=[^р-я])дни[.^р-я]*/",
			"/(?<=[^р-я])ме[.^р-я]*/",
			"/(?<=[^р-я])г[.^р-я]*/"
		);
		
		$valid = array
		(
			"January",
			"February",
			"March",
			"April",
			"May",
			"June",
			"July",
			"August",
			"September",
			"October",
			"November",
			"December",
			"seconds",
			"minutes",
			"hours",
			"day",
			"days",
			"months",
			"years"
		);

English to Bulgarian:


		$invalid = array
		(
			"/(?<=[^a-z])ja[.^a-z]*/",
			"/(?<=[^a-z])f[.^a-z]*/",
			"/(?<=[^a-z])mar[.^a-z]*/",
			"/(?<=[^a-z])ap[.^a-z]*/",
			"/(?<=[^a-z])may[.^a-z]*/",
			"/(?<=[^a-z])jun[.^a-z]*/",
			"/(?<=[^a-z])jul[.^a-z]*/",
			"/(?<=[^a-z])au[.^a-z]*/",
			"/(?<=[^a-z])s[.^a-z]*/",
			"/(?<=[^a-z])o[.^a-z]*/",
			"/(?<=[^a-z])n[.^a-z]*/",
			"/(?<=[^a-z])d[.^a-z]*/"
		);
		
		$valid = array
		(
			"Ян",
			"Фев",
			"Мар",
			"Апр",
			"Май",
			"Юни",
			"Юли",
			"Авг",
			"Септ",
			"Окт",
			"Ное",
			"Дек"
		);

#12 - morpha

Thu, 3 Feb 2011 12:04

Which I hope you won't use. There is a reason this is virtually exclusively done with <select>, it's reliable, much faster and easier to use and is language independent.

A regex that filters months or even complete dates from full text would be interesting, but from a text field that is expected to only contain a month's name anyway? I understand you wanted to get this working just to learn from it, but you should also understand that that was its sole purpose, it's not really useful

#13 - joshdifabio

Thu, 3 Feb 2011 17:38

Is [.^a-z] really what you want? That will match '.', '^' or any character in the a-z range whereas [^a-z] will match anything which is not a character in the a-z range.

Strfriend is useful when you're building regexes.

#14 - broken

Thu, 3 Feb 2011 21:15

[.^a-z]*
Not just [.^a-z]

And it won't match . or ^ or any a-z character.
[\.\^a-z] , however will, I believe. At least in PHP.

d[.^a-z]*
This will match anything that starts with d, and keep matching everything on it's way, until it hits a non a-z character. Then, it will stop. So basically, in "abcdefghijk lol" this will match "defghijk". And this is exactly what I was looking for. The head of it in the full solution above, makes sure that there are no a-z characters preceding it either. So, it just gets a full word, that starts with the sequence you specify in-between.

#15 - morpha

Thu, 3 Feb 2011 21:23

[.^a-z] WILL match . ^ and a-z, if you want it to NOT match those you have to start the character class with the circumflex (^) to make it a negated character class. The behaviour of most special characters is different inside character classes because, in fact, they are not special characters in them. Circumflex at the very beginning means "negate" but anywhere else in the class it'll just be the circumflex character.

#16 - joshdifabio

Thu, 3 Feb 2011 23:01

Quote from broken :[.^a-z]*
Not just [.^a-z]

And it won't match . or ^ or any a-z character.
[\.\^a-z] , however will, I believe. At least in PHP.

Nah, that's wrong. It is as I described in my last post and morpha goes into more detail above.

Look at the link I provided in my last post; it's very useful when building regexes and understanding what they actually do.

#17 - broken

Fri, 4 Feb 2011 09:43

Yeah, sorry.

It was late and I just replied without even testing if what I was saying is right. I just tested the regex with a "blah blah blah something.^^ blah blah blah", and it did indeed return the "something" with the ".^^" behind it.

I just thought that .^a-z meant (. = everything)(^a-z = until it hits a non a-z character)(* = keep repeating until the non a-z character is hit). And yeah -.- ...
So, I guess the regex in the code now should look like that:
/(?<=[^a-z])start[a-z]*/
Tested that and it seems to work perfectly. Thanks for the help.

And uh, sorry for my ignorance again..

#18 - joshdifabio

Fri, 4 Feb 2011 11:40

Fyi, /(?<=[^a-z])start[a-z]*/ is case sensitive and /(?<=[^a-z])start[a-z]*/i is case insensitive.