roadrunnertwice: Dee perpetrates some Mess. (Arts and crafts (Little Dee))
Nick Eff ([personal profile] roadrunnertwice) wrote in [site community profile] dw_dev2020-06-05 02:27 pm

Named Markup Formats

I've been working on a thing, and I could use some feedback on the implementation. It might take a little explaining (because there's a fair amount of backstory), but I'll try and be as concise as possible.

Backstory: Raw and transformed text

DW stores the text of entries and comments raw, exactly as the user entered it. Then, whenever we need to display that text, we transform it to produce nice legible HTML. Those transformations include:

  • Turning <user> tags (which aren't real HTML) into [profile] user, which is really like a span plus an image plus a link.
  • Handling <cut> tags.
  • Several other things, etc., not important right now.

Most of those get applied to everything we display. But there are also some OPTIONAL transformations:

  • Turning normal line breaks into HTML <br> tags.
  • Turning bare URLs into clickable links.

Those get applied by default, but we've always had a "don't autoformat" checkbox (inherited from LJ) that could be used to disable them for an entry or comment. (BTW, under the hood the RTE saves entries as "don't autoformat.")

Then, later, DW added some other optional transforms, which had their own special enabling conditions:

  • Turning Markdown into HTML. (For entries that start with a special !markdown glyph, or comments submitted by email.)
  • Turning @mentions into user tags. (Currently applies to everything except "don't autoformat," but gets suppressed within certain HTML elements or their Markdown equivalents.)

ALL of these transforms get handled by something called the "html cleaner," at LJ::CleanHTML. At this point "cleaner" is kind of a misnomer; in actual fact, it's the central place where we handle all transformations of raw user-entered text into a fragment of display HTML.

The problems

In my understanding, the current state of affairs has two main problems:

  • The interface for choosing text transforms is incoherent. That happened gradually; we've added new transformations over time, and changed the interactions between them, and now it's weird:
    • Half of the interfaces for entering entry/comment text don't even include the "don't autoformat" checkbox anymore.
    • The way you enable Markdown has always been a mystery. For example, I want to use Markdown in comments (because typing html angle brackets on a telephone is bullshit), but currently it's impossible except when responding via email.
  • Introducing new text transforms is dangerous and chaotic. In mid-2019, we enabled @mentions for HTML-formatted content (previously they only worked in Markdown content), and about 40% of hell broke loose:
    • Current content suffered because we didn't have a good way to beta-test @mentions, so we didn't have a chance to learn about bugs and edge cases from our users (who are more ingenious at doing weird textual shit than we are) before enabling them globally.
    • Old posts weren't written to expect @mentions, so we ended up totally vandalizing any historical post that ever discussed CSS code (@media (min-width: etc...)), Perl or Ruby or Objective-C code, or a wide variety of other things that involve @ signs.

Questions: Does anyone disagree with those two problems or my characterization of them? Does anyone see any closely related problems that I'm not recognizing here?

The solution, maybe

I've got a pull request up for discussion right now that tries to solve these immediate problems, as well as several other problems I can see coming down the road at us.

In brief (lol, sorry, I'm trying!!), here's what I'm proposing we do:

  • Instead of a set of disconnected text transform options that can be independently twiddled, we codify a set of named markup formats. Each entry or comment will specify exactly one format.
    • To start with, this will include all of the valid combinations of the current formatting options, which ends up being fewer than you might think:
      • markdown0 - Markdown via the Text::Markdown module, plus our own @mentions.
      • html_casual1 - Classic HTML-but-it-respects-your-linebreaks, plus @mentions.
      • html_casual0 - Like casual 1, except without @mentions. Old content was written to expect this, but we don't respect that anymore. Imported and syndicated content still uses this, though.
      • html_raw0 - "Don't autoformat." No @mentions.
      • html_extra_raw - Syndicated content that we know will never use special DW tags like <user> or <cut>.
    • (For existing posts that don't specify a format, we use the existing independent options to guess a format. We also take the post date into account, so we can restore casual 0 on posts that were originally written in it.)
  • When formatting text for display, the specified format determines all of the transforms we use.
    • No more independent twiddlies. The format captures the entirety of the user's intent for how their markup should be handled.
  • Formats are versioned.
    • If we want to introduce new formatting behavior (like @mentions) or make changes that might have non-linear consequences (like switching from Text::Markdown to a different Markdown processor), we have to implement it as a new format (e.g. markdown1).
    • Old versions stick around in the HTML cleaner FOR ALL ETERNITY, so that we can continue to display content as it was originally intended.
    • This doesn't preclude making small changes for safety or consistency -- for example, putting a user tag inside a link always used to generate illegal link-within-a-link markup that we'd just ignore and let your browser sort out, but I recently made it so we strip the inner link. But anything that changes what a user would expect to get from their markup should cut a new format.
  • When posting or editing a comment or entry in the web UI, the form includes a drop-down for choosing which format you want to use.
    • (With a descriptive name, not weird IDs like html_casual1.)
    • We only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be markdown0, html_casual1, and html_raw0.
      • The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change.
    • We remember the last format you chose when making a new post, and use that as your default.
    • (new detail) If you deliberately change the format to something other than your current default, we give you the option to set that as your new default after saving your post/edit.
    • If we're introducing a new format, we might set it as active it but mark it as beta, and keep the format it replaces for a while. When posting with a beta format, you're accepting the possibility that the markup behavior might retroactively change for that post, until the beta ends.
  • The selected format in a comment/entry form might change other behavior of the form.
    • Specifically, I'm thinking of markup helper buttons for the most common text styles, which is another back-burner project I'm working on -- the bold/italic/link buttons would add HTML tags for the HTML formats, and Markdown styles for Markdown.
    • Also thinking of the future replacement RTE here -- RTE would get its own named format (even though practically speaking it will probably emit something that looks like html_raw0), and switching your format to "Rich" would initialize the editor and its controls. Switching your format to something non-rich would serialize the editor's buffer as HTML and take you back to a plain textarea; we wouldn't need a separate control for switching to the RTE beyond the normal format drop-down that everything else uses.
    • ...I guess we could also add syntax highlighting or something for HTML and Markdown, but I don't think anyone's planning on it and I doubt anyone's desperate for it.
  • (new detail) When posting via email, we'll add a new post header for formatting. You can specify either a format ID (like html_casual0), or a shortcut ID to choose the most recent versions of casual HTML or Markdown. If you omit the header, it posts with the current global default for that type of content, ignoring your user default.
    • Why: Email is sort of halfway between a UI and an API. People who are automating email posts want to choose a stable and predictable format (so use IDs), but people who are writing in their real mail client probably want their posts to act like the version of the web form they're accustomed to. We'll ignore user defaults because your email client doesn't give you any feedback about what's going to happen, so we don't want it to depend on hidden settings state.
  • (new detail) When posting via the new API, you can specify a format ID or fall back to a global default. User default is ignored.
  • (new detail) Email and API posts aren't limited to the currently active formats; they can specify any format they want, including obsolete ones. This should help us avoid breaking automation that people build.
  • (new detail) When posting via the old XMLRPC API (so, old LJ clients like Semagic), you cannot select a named format. You'll be limited to today's formatting options (the "don't autoformat" checkbox if supported by the client, and the !markdown secret glyph), which will forever behave exactly like they do in mid 2020 (resulting in either markdown0, html_casual1, or html_raw0).
    • Why: Old client programs want things to keep working the way they've always worked... so that's what they'll get. And we don't want to add more secret glyphs, because they make things harder to maintain and understand.
  • (new detail) If a browser extension happened to mangle the web forms to re-enable obsolete formats, we wouldn't care. From our perspective, that's the same thing as an API client posting with a weird format, which is fine. (Removing obsolete formats from the menu is just about keeping things simple for most users.)

I THINK this will free us up to be much more nimble about modernizing the way we process text, and will increase user choice while ALSO making the site less confusing to use, which is a rare combo.

Questions: Does the approach make sense? Can you think of anything it would sabotage or prevent us from doing in the future? Can you see anything excessively complicated in the code itself? (By which I mean, can we make this simpler while still solving those two problems.)

Non-linear consequences

Here are the things I can think of that might be affected in weird ways by this change:

  • External clients and integrations. We'll be expecting new entries and comments to arrive with a format specified (using the new-ish editor property), so anything that posts content without that will be locked into the "guessed" format that fits their other options. (that's raw 0 if they set "don't autoformat," and casual 1 if they didn't.) Over time, that'll drift more and more out of date; when we add some new @mention-like thing that forces us to cut a casual 2, we're going to keep guessing casual 1 for posts with no metadata, because all of OUR stuff properly sets the format type.
    • TBH, I think this is 100% fine. Feel free to make a case for otherwise, tho.
  • All the documentation/FAQ pages about formatting stuff will be immediately out of date.
    • On it, don't worry.
  • Once we finally make Markdown discoverable, people might start requesting we add other oddball lightweight markup formats, like textile or RST.
    • I think the answer to that is basically "no?" It should be "no."
  • Another thing that might happen is that we start getting pressure to add some of the niceties of more modern Markdown implementations, like fenced code blocks and ascii-art tables.
    • Well, at least we'd be able to do that safely, by cutting a markdown1 and leaving existing content on markdown0. It's just a question of what we actually want our Markdown to act like, and THAT is going to be an exciting conversation.
  • Doubling down on the importance and centrality of the HTML cleaner, which is a highly complex thicket of code that not a lot of people feel comfortable working with.
    • I'm not aware of any plans to replace the cleaner, but this would definitely require them to be rethought. That said, we're always going to need SOME central thing that governs text transformations, and we could move the implementation to a different spot in the code if need be.

Please suggest more of these if you think of them.

dennisgorelik: 2020-06-13 in my home office (Default)

Reduce complexity first?

[personal profile] dennisgorelik 2020-06-06 04:57 am (UTC)(link)
Would it be possible to delete some features before adding new features?
E.g., is it possible to delete "Disable Auto-Formatting" option?
dennisgorelik: 2020-06-13 in my home office (Default)

"don't autoformat," checkbox

[personal profile] dennisgorelik 2020-06-06 06:01 am (UTC)(link)
> If you mean the checkbox labeled "don't autoformat,"

Where can I see "don't autoformat," checkbox?

Here:
https://www.dreamwidth.org/update.bml
I see "Disable Auto-Formatting" checkbox.
alyndra: (circular reasoning)

Re: "don't autoformat," checkbox

[personal profile] alyndra 2020-06-06 02:41 pm (UTC)(link)
If you click reply to any comment and then click More Options, you will see “don’t auto-format.” But it really means the same thing as “disable auto-formatting.”

But to answer your original questions about if it is possible to get rid of some features and not have the RTE anymore, you might really like the Beta Create Entries page. You can toggle it on or off at dreamwidth.org/beta and if you click the Settings wheel there you can choose which options you want visible on the page when you’re making posts.
Edited 2020-06-06 14:46 (UTC)
dennisgorelik: 2020-06-13 in my home office (Default)

Re: "don't autoformat," checkbox

[personal profile] dennisgorelik 2020-06-06 03:07 pm (UTC)(link)
Why does Dreamwitch use different labels for naming the same thing?
The label on such checkbox should be "Don't auto-format:" or "Disable Auto-Formatting", but not both, right?
Or better yet - do not have that checkbox at all and activate such option through some form of markup, because such option is needed only for [rare] advanced usages.
azurelunatic: Vivid pink Alaskan wild rose. (Default)

Re: "don't autoformat," checkbox

[personal profile] azurelunatic 2020-06-06 05:13 pm (UTC)(link)
They definitely should become consistent.
dennisgorelik: 2020-06-13 in my home office (Default)

"Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 03:19 pm (UTC)(link)
I opened:
https://www.dreamwidth.org/beta
and clicked "Turn ON beta testing" button in "New Create Entries Page" section.
My first impression is that the Beta version of:
https://www.dreamwidth.org/entry/new
looks better than the standard version.
It is cleaner and customizable.
I still need to actually use it (post blog entry) to decide if it is, actually, an improvement.

"Create Entries" Beta page may lose changes if I collapse "Settings" without saving. Which is a little bit counterintuitive. But it is a nitpick at this time.

Thank you.
dennisgorelik: 2020-06-13 in my home office (Default)

Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 03:52 pm (UTC)(link)
"Spell check" button is no longer needed, because browsers' support for spellchecking is good enough already.
If significant number of Dreamwidth users really need "Dreamwidth spellchecking" - make "Spell check" button "hideable"/optional.

This is not critical change, of course, but like with any other cleanup - less features make product easier to use and easier to maintain.
azurelunatic: Vivid pink Alaskan wild rose. (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] azurelunatic 2020-06-06 05:17 pm (UTC)(link)
Spell check is planned for removal for all the good reasons you mentioned.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 05:30 pm (UTC)(link)
Cool.
I also generally prioritize features removal over features addition, because feature removal makes codebase much easier to operate.
"Delete feature A and then add feature B" is easier than "Add feature B and then delete feature A".
Edited 2020-06-06 17:31 (UTC)
azurelunatic: Vivid pink Alaskan wild rose. (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] azurelunatic 2020-06-06 05:43 pm (UTC)(link)
From the perspective of maintaining the code, definitely.

From the perspective of having a working website, the deletion of some things can't go to production until the replacement is ready.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 06:28 pm (UTC)(link)
> until the replacement is ready

"Spellchecker replacement" was ready and deployed many years ago.
alexseanchai: Katsuki Yuuri wearing a blue jacket and his glasses and holding a poodle, in front of the asexual pride flag with a rainbow heart inset. (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] alexseanchai 2020-06-06 08:08 pm (UTC)(link)
Create Entries replacement was not ready any length of time ago, because the rich text editor is not ready yet.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 08:18 pm (UTC)(link)
I meant that "Spellchecker" feature, probably, should have been deleted earlier.
I did not mean that Create Entries replacement should be implemented after deleting old version of Create Entries page.
I find current approach with having both versions active during testing and transition - reasonable.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Reduce complexity first?

[personal profile] dennisgorelik 2020-06-06 06:05 am (UTC)(link)
> If you mean taking away the ability to do raw HTML without getting your linebreaks mangled

I do not mean that.
Advanced formatting (including command preserving raw HTML) could be done with advanced syntax in the posting text.