roadrunnertwice: Dee perpetrates some Mess. (Arts and crafts (Little Dee))
Nick Eff ([personal profile] roadrunnertwice) wrote in [site community profile] dw_dev2020-06-05 02:27 pm

Named Markup Formats

I've been working on a thing, and I could use some feedback on the implementation. It might take a little explaining (because there's a fair amount of backstory), but I'll try and be as concise as possible.

Backstory: Raw and transformed text

DW stores the text of entries and comments raw, exactly as the user entered it. Then, whenever we need to display that text, we transform it to produce nice legible HTML. Those transformations include:

  • Turning <user> tags (which aren't real HTML) into [profile] user, which is really like a span plus an image plus a link.
  • Handling <cut> tags.
  • Several other things, etc., not important right now.

Most of those get applied to everything we display. But there are also some OPTIONAL transformations:

  • Turning normal line breaks into HTML <br> tags.
  • Turning bare URLs into clickable links.

Those get applied by default, but we've always had a "don't autoformat" checkbox (inherited from LJ) that could be used to disable them for an entry or comment. (BTW, under the hood the RTE saves entries as "don't autoformat.")

Then, later, DW added some other optional transforms, which had their own special enabling conditions:

  • Turning Markdown into HTML. (For entries that start with a special !markdown glyph, or comments submitted by email.)
  • Turning @mentions into user tags. (Currently applies to everything except "don't autoformat," but gets suppressed within certain HTML elements or their Markdown equivalents.)

ALL of these transforms get handled by something called the "html cleaner," at LJ::CleanHTML. At this point "cleaner" is kind of a misnomer; in actual fact, it's the central place where we handle all transformations of raw user-entered text into a fragment of display HTML.

The problems

In my understanding, the current state of affairs has two main problems:

  • The interface for choosing text transforms is incoherent. That happened gradually; we've added new transformations over time, and changed the interactions between them, and now it's weird:
    • Half of the interfaces for entering entry/comment text don't even include the "don't autoformat" checkbox anymore.
    • The way you enable Markdown has always been a mystery. For example, I want to use Markdown in comments (because typing html angle brackets on a telephone is bullshit), but currently it's impossible except when responding via email.
  • Introducing new text transforms is dangerous and chaotic. In mid-2019, we enabled @mentions for HTML-formatted content (previously they only worked in Markdown content), and about 40% of hell broke loose:
    • Current content suffered because we didn't have a good way to beta-test @mentions, so we didn't have a chance to learn about bugs and edge cases from our users (who are more ingenious at doing weird textual shit than we are) before enabling them globally.
    • Old posts weren't written to expect @mentions, so we ended up totally vandalizing any historical post that ever discussed CSS code (@media (min-width: etc...)), Perl or Ruby or Objective-C code, or a wide variety of other things that involve @ signs.

Questions: Does anyone disagree with those two problems or my characterization of them? Does anyone see any closely related problems that I'm not recognizing here?

The solution, maybe

I've got a pull request up for discussion right now that tries to solve these immediate problems, as well as several other problems I can see coming down the road at us.

In brief (lol, sorry, I'm trying!!), here's what I'm proposing we do:

  • Instead of a set of disconnected text transform options that can be independently twiddled, we codify a set of named markup formats. Each entry or comment will specify exactly one format.
    • To start with, this will include all of the valid combinations of the current formatting options, which ends up being fewer than you might think:
      • markdown0 - Markdown via the Text::Markdown module, plus our own @mentions.
      • html_casual1 - Classic HTML-but-it-respects-your-linebreaks, plus @mentions.
      • html_casual0 - Like casual 1, except without @mentions. Old content was written to expect this, but we don't respect that anymore. Imported and syndicated content still uses this, though.
      • html_raw0 - "Don't autoformat." No @mentions.
      • html_extra_raw - Syndicated content that we know will never use special DW tags like <user> or <cut>.
    • (For existing posts that don't specify a format, we use the existing independent options to guess a format. We also take the post date into account, so we can restore casual 0 on posts that were originally written in it.)
  • When formatting text for display, the specified format determines all of the transforms we use.
    • No more independent twiddlies. The format captures the entirety of the user's intent for how their markup should be handled.
  • Formats are versioned.
    • If we want to introduce new formatting behavior (like @mentions) or make changes that might have non-linear consequences (like switching from Text::Markdown to a different Markdown processor), we have to implement it as a new format (e.g. markdown1).
    • Old versions stick around in the HTML cleaner FOR ALL ETERNITY, so that we can continue to display content as it was originally intended.
    • This doesn't preclude making small changes for safety or consistency -- for example, putting a user tag inside a link always used to generate illegal link-within-a-link markup that we'd just ignore and let your browser sort out, but I recently made it so we strip the inner link. But anything that changes what a user would expect to get from their markup should cut a new format.
  • When posting or editing a comment or entry in the web UI, the form includes a drop-down for choosing which format you want to use.
    • (With a descriptive name, not weird IDs like html_casual1.)
    • We only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be markdown0, html_casual1, and html_raw0.
      • The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change.
    • We remember the last format you chose when making a new post, and use that as your default.
    • (new detail) If you deliberately change the format to something other than your current default, we give you the option to set that as your new default after saving your post/edit.
    • If we're introducing a new format, we might set it as active it but mark it as beta, and keep the format it replaces for a while. When posting with a beta format, you're accepting the possibility that the markup behavior might retroactively change for that post, until the beta ends.
  • The selected format in a comment/entry form might change other behavior of the form.
    • Specifically, I'm thinking of markup helper buttons for the most common text styles, which is another back-burner project I'm working on -- the bold/italic/link buttons would add HTML tags for the HTML formats, and Markdown styles for Markdown.
    • Also thinking of the future replacement RTE here -- RTE would get its own named format (even though practically speaking it will probably emit something that looks like html_raw0), and switching your format to "Rich" would initialize the editor and its controls. Switching your format to something non-rich would serialize the editor's buffer as HTML and take you back to a plain textarea; we wouldn't need a separate control for switching to the RTE beyond the normal format drop-down that everything else uses.
    • ...I guess we could also add syntax highlighting or something for HTML and Markdown, but I don't think anyone's planning on it and I doubt anyone's desperate for it.
  • (new detail) When posting via email, we'll add a new post header for formatting. You can specify either a format ID (like html_casual0), or a shortcut ID to choose the most recent versions of casual HTML or Markdown. If you omit the header, it posts with the current global default for that type of content, ignoring your user default.
    • Why: Email is sort of halfway between a UI and an API. People who are automating email posts want to choose a stable and predictable format (so use IDs), but people who are writing in their real mail client probably want their posts to act like the version of the web form they're accustomed to. We'll ignore user defaults because your email client doesn't give you any feedback about what's going to happen, so we don't want it to depend on hidden settings state.
  • (new detail) When posting via the new API, you can specify a format ID or fall back to a global default. User default is ignored.
  • (new detail) Email and API posts aren't limited to the currently active formats; they can specify any format they want, including obsolete ones. This should help us avoid breaking automation that people build.
  • (new detail) When posting via the old XMLRPC API (so, old LJ clients like Semagic), you cannot select a named format. You'll be limited to today's formatting options (the "don't autoformat" checkbox if supported by the client, and the !markdown secret glyph), which will forever behave exactly like they do in mid 2020 (resulting in either markdown0, html_casual1, or html_raw0).
    • Why: Old client programs want things to keep working the way they've always worked... so that's what they'll get. And we don't want to add more secret glyphs, because they make things harder to maintain and understand.
  • (new detail) If a browser extension happened to mangle the web forms to re-enable obsolete formats, we wouldn't care. From our perspective, that's the same thing as an API client posting with a weird format, which is fine. (Removing obsolete formats from the menu is just about keeping things simple for most users.)

I THINK this will free us up to be much more nimble about modernizing the way we process text, and will increase user choice while ALSO making the site less confusing to use, which is a rare combo.

Questions: Does the approach make sense? Can you think of anything it would sabotage or prevent us from doing in the future? Can you see anything excessively complicated in the code itself? (By which I mean, can we make this simpler while still solving those two problems.)

Non-linear consequences

Here are the things I can think of that might be affected in weird ways by this change:

  • External clients and integrations. We'll be expecting new entries and comments to arrive with a format specified (using the new-ish editor property), so anything that posts content without that will be locked into the "guessed" format that fits their other options. (that's raw 0 if they set "don't autoformat," and casual 1 if they didn't.) Over time, that'll drift more and more out of date; when we add some new @mention-like thing that forces us to cut a casual 2, we're going to keep guessing casual 1 for posts with no metadata, because all of OUR stuff properly sets the format type.
    • TBH, I think this is 100% fine. Feel free to make a case for otherwise, tho.
  • All the documentation/FAQ pages about formatting stuff will be immediately out of date.
    • On it, don't worry.
  • Once we finally make Markdown discoverable, people might start requesting we add other oddball lightweight markup formats, like textile or RST.
    • I think the answer to that is basically "no?" It should be "no."
  • Another thing that might happen is that we start getting pressure to add some of the niceties of more modern Markdown implementations, like fenced code blocks and ascii-art tables.
    • Well, at least we'd be able to do that safely, by cutting a markdown1 and leaving existing content on markdown0. It's just a question of what we actually want our Markdown to act like, and THAT is going to be an exciting conversation.
  • Doubling down on the importance and centrality of the HTML cleaner, which is a highly complex thicket of code that not a lot of people feel comfortable working with.
    • I'm not aware of any plans to replace the cleaner, but this would definitely require them to be rethought. That said, we're always going to need SOME central thing that governs text transformations, and we could move the implementation to a different spot in the code if need be.

Please suggest more of these if you think of them.

azurelunatic: Vivid pink Alaskan wild rose. (Default)

[personal profile] azurelunatic 2020-06-05 10:47 pm (UTC)(link)
This sounds like it will retain the obnoxious bug-in-practice that goes like:

* User goes to start an entry
* User picks a format at semi-random
* User posts
* User discovers that it was not the intended format
* User edits to pick intended format
[some time passes]
* User makes new entry
* Format of new entry retains the original, incorrect, randomly selected format
* User believes that last entry was posted with correct format, is annoyed that this one isn't
* User edits most recent entry
* Cycle repeats

(no subject)

[personal profile] azurelunatic - 2020-06-06 17:19 (UTC) - Expand

(no subject)

[personal profile] fred_mouse - 2020-06-07 09:25 (UTC) - Expand
dennisgorelik: 2020-06-13 in my home office (Default)

RTE

[personal profile] dennisgorelik 2020-06-06 03:40 am (UTC)(link)
What is "RTE"?

Re: RTE

[personal profile] dennisgorelik - 2020-06-06 04:59 (UTC) - Expand
pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)

[personal profile] pauamma 2020-06-06 04:57 am (UTC)(link)
I believe this (or perhaps a simplified version) should be posted to a user-centric community instead of/in addition to here, so that end users can opine.
chalcedony_starlings: Two scribbled waveforms, one off-black and one off-white, overlapping, on a flat darkish purpleish background. (scribble twins)

[personal profile] chalcedony_starlings 2020-06-06 10:41 am (UTC)(link)

I would like to mention tangentially, from a relatively technical end-user perspective, that the inconsistency in places is horribly grating even when I do know how to and often prefer to write raw HTML. Specifically, though I'm not sure where this fits into the above, the amount of <!--, newline, --> in our profile text because there's no way to disable having a <br> jammed into the middle of what would otherwise be a void between list items or something.

Also, I've gotten into the irritating habit of never writing newlines in comments (I'm doing it right now, in fact), because I have an Aversion to having </p><p> be rendered as <br><br>, and the “no, don't give me this comment form, give me the one which allows me to do that but also doesn't show context in a remotely ergonomic way etc.” alternative got too grating to use over time. I take it the Markdown conversion does better at this?

(no subject)

[personal profile] alexseanchai - 2020-06-06 23:39 (UTC) - Expand
dennisgorelik: 2020-06-13 in my home office (Default)

Reduce complexity first?

[personal profile] dennisgorelik 2020-06-06 04:57 am (UTC)(link)
Would it be possible to delete some features before adding new features?
E.g., is it possible to delete "Disable Auto-Formatting" option?

Re: "don't autoformat," checkbox

[personal profile] alyndra - 2020-06-06 14:41 (UTC) - Expand

"Create Entries" Beta page

[personal profile] dennisgorelik - 2020-06-06 15:19 (UTC) - Expand
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2020-06-06 07:15 am (UTC)(link)
This is thorough and thought provoking.

As the person primarily responsible for much of the mess we're in today, the driving goal behind Markdown and @-mentions is to make Dreamwidth easier to use and more accessible (as a technology). Having to learn to write HTML is not something one should _have_ to do in order to express oneself, and so I strongly feel that we should be moving in directions to make it easier.

That said, we are of course built on the idea of customization and being able to really go deep in what kinds of things you can do to your content. That has many positive things, but ultimately, LJ::CleanHTML is the kind of thing you end up with when you want to make real ultimate cosmic power but also combine it with a reading page that has to combine _my_ journal customization insanity with _your_ entry customization insanity. God help us all.

So here we are. You're untangling my hacks, and I appreciate that.

I think basically I agree with your proposal and what you're thinking here, and I concur that we should draw the line at Markdown (and possibly some additions to it, since the Perl Markdown renderer is pretty minimal).

Ultimately tho, it doesn't look like this will really _change_ user behavior/experience? They will see a new dropdown/UI change that lets them explicitly select their formatting, which will save itself (but I saw the comments above, which is a good point to try to make sure we don't confuse/annoy users by changing their defaults unexpectedly).

One thing not mentioned, but I think because it's obvious but I'll say it anyway so it's documented, crossposting doesn't change here right? Since we just render that out as HTML and pass it raw to the target site and ask them not to mangle it. That should work just fine.

(As an aside, something I've been pondering as I've looked at LJ::CleanHTML in the past is if we could drastically simplify it with some modern technologies. It does a lot of work to try to block things like scripts, but possibly we can just use CSP or such to block those by policy and then simplify? I suppose this doesn't fix a lot of the "HTML fixing" we do though, so ultimately it probably doesn't actually simplify the cleaner by any material amount. Oh well.)
Edited (yes I got savaged by my own feature why) 2020-06-06 07:15 (UTC)

(no subject)

[personal profile] alexseanchai - 2020-06-06 17:25 (UTC) - Expand

[personal profile] pinterface 2020-06-06 05:56 pm (UTC)(link)

I agree, [personal profile] roadrunnertwice has done a really nice job thinking this through.

[LJ::CleanHTML] does a lot of work to try to block things like scripts, but possibly we can just use CSP or such to block those by policy and then simplify?

It might just be my age and remembering a time before CSP, but I'd pretty strongly recommend against relying on CSP alone. CSP is great as a backup for when something gets through, but there are a number of cases where even a perfectly-specified CSP is not going to work very well:

  • Any browser that doesn't understand CSP. (Older browsers, at least.)
  • Cross-posting. Gotta clean stuff before shipping it off, both so DW doesn't look like an attacker to any automated systems, and because relying on the other end to be secure seems like a not-great idea.
  • Non-HTML contexts. Presumably you'd still want to clean things before they end up in RSS, for instance, so JS doesn't end up in somebody's feed reader.

More generally, I'd argue that relying on CSP would move security issues from DW's control to third-party control (browser vendors, cross-post targets, e-mail clients, feed readers), which is a lot of faith to put in other people getting security right. Yes, use CSP! But in conjunction with existing security measures, not as a replacement.

It might be possible to simplify things by layering them—one layer to do input transformations, and another for security transformations, say—but that would require very careful design if input transformations need to produce things that would be rejected by the security layer (so no reserved HTML classes or ids, at least without shenanigans like using a random nonce or signed token that gets replaced with the reserved value).

hitchhiker: image of "don't panic" towel with a rocketship and a 42 (Default)

[personal profile] hitchhiker 2020-06-06 09:28 pm (UTC)(link)
i took a look at LJ::CleanHTML and the code was way nicer than i was expecting from these comments :) then again, i just read it rather than trying to modify it - are there specific thing that people find hard to work with in practice?

(no subject)

[personal profile] azurelunatic - 2020-06-09 22:51 (UTC) - Expand
alyndra: (Default)

[personal profile] alyndra 2020-06-06 09:08 am (UTC)(link)
When it sets your default for you, would that apply to posts or comments you email in, too? Or would you still have to identify if you want markdown at the top of your email?

I think editing a post to a different format should also set that format as your default, if we’re going with using a format sets it as default. I think if anything there should be a ticky (possibly on an interstitial page) asking to check if the user doesn’t want it set as the new default.

Otherwise just don’t set it automatically at all anywhere, and always include a ticky for making it your default. Those are the two options I see. I don’t see a compelling reason for the editing page to behave differently than post or comment.
annathecrow: screenshot from Star Wars: The Phantom Menace. A detail of the racing pod engines. (Default)

[personal profile] annathecrow 2020-06-06 01:46 pm (UTC)(link)
As for "any closely related problems", I'm wondering how would it behave when you switch between formats? Currently, when switching between RTE and raw HTML, it sometimes "bakes in" line breaks, adding
tags where you didn't want them, even repeatedly. (I don't have specific reproducible steps but I'm fairly certain it wasn't just a user error.)

Otherwise, I think this is a very neat solution to the problem and can't see any obvious issues with it.

(Also, yay for markdown in comment forms! I'm not very taken with the current DW markdown flavor, but it still beats writing HTML markup by hand.)
eftychia: Me in kilt and poofy shirt, facing away, playing acoustic guitar behind head (Default)

html_casual0

[personal profile] eftychia 2020-06-06 07:51 pm (UTC)(link)
I might be the only one affected by this, but ... "only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be markdown0, html_casual1, and html_raw0." -- I use a shell script to crosspost between DJ, IJ, & DW (I was using a command-line client for all three but it stopped working with DW so I switched to the script using post-by-mail for here), which means I'm basically using coding without @mentions, and have been using @ to mark Twitter usernames in attributions for quotations. So the ability to select html_casual0, especially in post-by-email, would be useful to me. (I've currently got the usual way Twitter IDs show up in my entries special-cased in the script
:%s/(@\([^ ]*\))/(@\1.twitter)/g

but sometimes they're not in quite that context and I have to hand-edit my entry after it's posted.)

Basically, I'm trying to post as though it were syndicated, despite not using a syndication API. So it'd be nice if I had access to a syndication-friendly input format.

(Also, is there a way to specify "do not autoformat" in post-by-email? My script currently tries to reformat from html_casual0 into something suitable for allow-autoformat.)

The other reason I'd like to keep an html_casual0 option is that it's easier for me to edit in HTML with line breaks in my source, but specify presentation linebreaks with <p> and <br />, which is what I used to get with "clive -p". So my script'd be more reliable if I can strip out the bit where it attempts to remove existing linebreaks and replace <p>..</p> and <br /> with new linebreaks.

I have no idea how many other people are still using command-line clients to post entries though. (BTW, if anybody's hacked Clive to work with DW again like it used to, and are willing to share your diffs, I'd be grateful. I was using Clive 0.4.9 when the DW API changed just enough to break it, but looking at the Changelog for Clive 0.4.10 it doesn't look like any DW support was added to it.)
runpunkrun: Pride flag based on Gilbert Baker's 1978 rainbow flag with hot pink, red, orange, yellow, sage, turquoise, blue, and purple stripes. (Default)

[personal profile] runpunkrun 2020-06-06 10:23 pm (UTC)(link)
I read this, and understood some. Here's my contribution: Because I am a monster, and because DW allows it, I'll often specify !markdown in a post and then go ahead and use an unholy mix of HTML and Markdown. Would that mixed usage--perhaps for older posts only--be covered by one of your format modules?

Also I use Semagic to post to DW, so I wouldn't be able to select a format for each post. Instead of having the site guess what format to use, would it be possible to create a "!markdown" like command to put at the head of a post to indicate the format? Or maybe have an option in settings where a user can choose a default format for new posts.

(no subject)

[personal profile] runpunkrun - 2020-06-06 23:21 (UTC) - Expand
fred_mouse: line drawing of sheep coloured in queer flag colours with dream bubble reading 'dreamwidth' (Default)

[personal profile] fred_mouse 2020-06-07 09:22 am (UTC)(link)
I've read this (and bits of the git repository you linked to) and while I have not noticed any issues, I'm not likely to, because I don't understand enough of the code I'm reading! I am, however, very willing to do some testing and see if I can break things, if there is a way for me to do so.

(no subject)

[personal profile] fred_mouse - 2020-06-11 11:52 (UTC) - Expand
jducoeur: (Default)

[personal profile] jducoeur 2020-06-07 06:25 pm (UTC)(link)
I don't know the code (haven't touched Perl seriously in 15 years), but as a techie user who wrestles with similar problems on my own site: LGTM. In particular, it sounds like it gets me what I *personally* want, which is to be able to use Markdown consistently without needing the glyph...
squirrelitude: (Default)

[personal profile] squirrelitude 2020-06-26 09:52 pm (UTC)(link)
This looks great! Couple of questions:

----

1) What happens if you find a security issue in an older formatter and need to overhaul or replace the implementation? (Perhaps the implementation is so broken it *can't* be fixed, and needs replacement.)

Can DW reasonably do a search to sample for posts and comments where the change makes a meaningful difference in output? And what happens if there's a discrepancy?

Technically, your proposal wouldn't change this situation all that much, since DW *already* has this problem. But it might offer more solutions. For example, you could "clone" markdown1 as markdown1_pre_20210101, set old markdown1 posts to use it, and fix markdown1 in place. If new posts are *forbidden* from using markdown1_pre_20210101, and editing an old post bearing that formatter switches it to markdown1 so that people can't use them to host exploits, then you could have your cake and eat it too...

This would be more complexity, and I'm not suggesting it's a great idea, but I think it's worth considering how that might play out. Old libraries sometimes have to be entirely replaced, and those corner cases can be *hard* to replicate.

(Another possibility is rendering all the old content to HTML in order to freeze it, and storing that alongside the source markup.)

----

2) You write about the scenario where someone edits an old post:

« The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change. »

What happens if I edit an old post and change the format selector to a newer format, and realize I've made a mistake and actually need it to go back to the deprecated format? Will the selector still show the *previous/original* format?

(no subject)

[personal profile] squirrelitude - 2020-06-29 12:07 (UTC) - Expand